Communication can be a major bottleneck in distributed systems. Dask-CUDA addresses this by supporting integration with UCX, an optimized communication framework that provides high-performance networking and supports a variety of transport methods, including NVLink and InfiniBand for systems with specialized hardware, and TCP for systems without it. This integration is enabled through UCX-Py, an interface that provides Python bindings for UCX.
To use UCX with NVLink or InfiniBand, relevant GPUs must be connected with NVLink bridges or NVIDIA Mellanox InfiniBand Adapters, respectively. NVIDIA provides comparison charts for both NVLink bridges and InfiniBand adapters.
UCX integration requires an environment with both UCX and UCX-Py installed; see UCX-Py Installation for detailed instructions on this process.
When using UCX, each NVLink and InfiniBand memory buffer must create a mapping between each unique pair of processes they are transferred across; this can be quite costly, potentially in the range of hundreds of milliseconds per mapping. For this reason, it is strongly recommended to use RAPIDS Memory Manager (RMM) to allocate a memory pool that is only prone to a single mapping operation, which all subsequent transfers may rely upon. A memory pool also prevents the Dask scheduler from deserializing CUDA data, which will cause a crash.
In addition to installations of UCX and UCX-Py on your system, several options must be specified within your Dask configuration to enable the integration.
Typically, these will affect
UCX_SOCKADDR_TLS_PRIORITY, environment variables used by UCX to decide what transport methods to use and which to prioritize, respectively.
However, some will affect related libraries, such as RMM:
ucx.cuda_copy: true– required.
UCX_TLS, enabling CUDA transfers over UCX.
ucx.tcp: true– required.
UCX_TLS, enabling TCP transfers over UCX; this is required for very small transfers which are inefficient for NVLink and InfiniBand.
ucx.nvlink: true– required for NVLink.
UCX_TLS, enabling NVLink transfers over UCX; affects intra-node communication only.
ucx.infiniband: true– required for InfiniBand.
UCX_TLS, enabling InfiniBand transfers over UCX.
For optimal performance with UCX 1.11 and above, it is recommended to also set the environment variables
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda, see documentation here and here for more details on those variables.
ucx.rdmacm: true– recommended for InfiniBand.
UCX_SOCKADDR_TLS_PRIORITY, enabling remote direct memory access (RDMA) for InfiniBand transfers. This is recommended by UCX for use with InfiniBand, and will not work if InfiniBand is disabled.
ucx.net-devices: <str>– recommended for UCX 1.9 and older.
UCX_NET_DEVICESinstead of defaulting to
"all", which can result in suboptimal performance. If using InfiniBand, set to
"auto"to automatically detect the InfiniBand interface closest to each GPU on UCX 1.9 and below. If InfiniBand is disabled, set to a UCX-compatible ethernet interface, e.g.
"enp1s0f0"on a DGX-1. All available UCX-compatible interfaces can be listed by running
UCX 1.11 and above is capable of identifying closest interfaces without setting
"auto"(deprecated for UCX 1.11 and above), it is recommended not to set
ucx.net-devicesin most cases. However, some recommendations for optimal performance apply, see the documentation on
ucx.infinibandabove fore details.
ucx.net-devices: "auto"assumes that all InfiniBand interfaces on the system are connected and properly configured; undefined behavior may occur otherwise. ``ucx.net-devices: “auto”`` is *DEPRECATED* for UCX 1.11 and above.
rmm.pool-size: <str|int>– recommended.
Allocates an RMM pool of the specified size for the process; size can be provided with an integer number of bytes or in human readable format, e.g.
"4GB". It is recommended to set the pool size to at least the minimum amount of memory used by the process; if possible, one can map all GPU memory to a single pool, to be utilized for the lifetime of the process.
These options can be used with mainline Dask.distributed. However, some features are exclusive to Dask-CUDA, such as the automatic detection of InfiniBand interfaces. See Dask-CUDA – Motivation for more details on the benefits of using Dask-CUDA.
See Enabling UCX communication for examples of UCX usage with different supported transports.