Enabling UCX communication¶
A CUDA cluster using UCX communication can be started automatically with LocalCUDACluster or manually with the
dask-cuda-worker CLI tool.
In either case, a
dask.distributed.Client must be made for the worker cluster using the same Dask UCX configuration; see UCX Integration – Configuration for details on all available options.
When using LocalCUDACluster with UCX communication, all required UCX configuration is handled through arguments supplied at construction; see API – Cluster for a complete list of these arguments. To connect a client to a cluster with all supported transports and an RMM pool:
from dask.distributed import Client from dask_cuda import LocalCUDACluster cluster = LocalCUDACluster( protocol="ucx", interface="ib0", enable_tcp_over_ucx=True, enable_nvlink=True, enable_infiniband=True, enable_rdmacm=True, ucx_net_devices="auto", rmm_pool_size="1GB" ) client = Client(cluster)
dask-cuda-worker with communication, the scheduler, workers, and client must all be started manually, each using the same UCX configuration.
UCX configuration options will need to be specified for
dask-scheduler as environment variables; see Dask Configuration – Environment Variables for more details on the mapping between environment variables and options.
To start a Dask scheduler using UCX with all supported transports and an gigabyte RMM pool:
$ DASK_UCX__CUDA_COPY=True \ > DASK_UCX__TCP=True \ > DASK_UCX__NVLINK=True \ > DASK_UCX__INFINIBAND=True \ > DASK_UCX__RDMACM=True \ > DASK_UCX__NET_DEVICES=mlx5_0:1 \ > DASK_RMM__POOL_SIZE=1GB \ > dask-scheduler --protocol ucx --interface ib0
Note the specification of
"mlx5_0:1" as our UCX net device; because the scheduler does not rely upon Dask-CUDA, it cannot automatically detect InfiniBand interfaces, so we must specify one explicitly.
We communicate to the scheduler that we will be using UCX with the
--protocol option, and that we will be using InfiniBand with the
All UCX configuration options have analogous options in
dask-cuda-worker; see API – Worker for a complete list of these options.
To start a cluster with all supported transports and an RMM pool:
$ dask-cuda-worker ucx://<scheduler_address>:8786 \ > --enable-tcp-over-ucx \ > --enable-nvlink \ > --enable-infiniband \ > --enable-rdmacm \ > --net-devices="auto" \ > --rmm-pool-size="1GB"
A client can be configured to use UCX by using
dask_cuda.initialize, a utility which takes the same UCX configuring arguments as LocalCUDACluster and adds them to the current Dask configuration used when creating it; see API – Client initialization for a complete list of arguments.
To connect a client to the cluster we have made:
from dask.distributed import Client from dask_cuda.initialize import initialize initialize( enable_tcp_over_ucx=True, enable_nvlink=True, enable_infiniband=True, enable_rdmacm=True, net_devices="mlx5_0:1", ) client = Client("ucx://<scheduler_address>:8786")
Note again the specification of
"mlx5_0:1" as our UCX net device, due to the fact that the client does not support automatic detection of InfiniBand interfaces.