Quickstart#
This page introduces the basics of a shuffle using rapidsmpf
.
rapidsmpf
is designed as a service that plugs into other libraries. This means
it isn’t typically used as a standalone library, and is expected to operate in
some larger runtime.
Dask-cuDF Example#
rapidsmpf
can be used with Dask-cuDF to shuffle a Dask DataFrame. This toy
example just loads the shuffled data into GPU memory. In practice, you would
reduce the output or write it to disk after shuffling.
import dask.distributed
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from rapidsmpf.examples.dask import dask_cudf_shuffle
df = dask.datasets.timeseries().reset_index(drop=True).to_backend("cudf")
# RapidsMPF is compatible with `dask_cuda` workers.
# Use an rmm pool for optimal performance.
with LocalCUDACluster(rmm_pool_size=0.8) as cluster:
with dask.distributed.Client(cluster) as client:
shuffled = dask_cudf_shuffle(df, on=["name"])
# collect the results in memory.
result = shuffled.compute()
After shuffling on name
, all of the records with a particular name will be in
the same partition. See Dask Integration for more.