Quickstart#

This page introduces the basics of a shuffle using rapidsmpf.

rapidsmpf is designed as a service that plugs into other libraries. This means it isn’t typically used as a standalone library, and is expected to operate in some larger runtime.

Dask-cuDF Example#

rapidsmpf can be used with Dask-cuDF to shuffle a Dask DataFrame. This toy example just loads the shuffled data into GPU memory. In practice, you would reduce the output or write it to disk after shuffling.

import dask.distributed
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster

from rapidsmpf.examples.dask import dask_cudf_shuffle


df = dask.datasets.timeseries().reset_index(drop=True).to_backend("cudf")

# RapidsMPF is compatible with `dask_cuda` workers.
# Use an rmm pool for optimal performance.
with LocalCUDACluster(rmm_pool_size=0.8) as cluster:
    with dask.distributed.Client(cluster) as client:
        shuffled = dask_cudf_shuffle(df, on=["name"])

        # collect the results in memory.
        result = shuffled.compute()

After shuffling on name, all of the records with a particular name will be in the same partition. See Dask Integration for more.