API#

Cluster#

class dask_cuda.LocalCUDACluster(

CUDA_VISIBLE_DEVICES=None,

n_workers=None,

threads_per_worker=1,

memory_limit='auto',

device_memory_limit='default',

enable_cudf_spill=False,

cudf_spill_stats=0,

local_directory=None,

protocol=None,

enable_tcp_over_ucx=None,

enable_infiniband=None,

enable_nvlink=None,

enable_rdmacm=None,

rmm_pool_size=None,

rmm_maximum_pool_size=None,

rmm_managed_memory=False,

rmm_async=False,

rmm_allocator_external_lib_list=None,

rmm_release_threshold=None,

rmm_log_directory=None,

rmm_track_allocations=False,

log_spilling=False,

pre_import=None,

**kwargs,

)[source]#

A variant of dask.distributed.LocalCluster that uses one GPU per process.

This assigns a different CUDA_VISIBLE_DEVICES environment variable to each Dask worker process.

For machines with a complex architecture mapping CPUs, GPUs, and network hardware, such as NVIDIA DGX-1 and DGX-2, this class creates a local cluster that tries to respect this hardware as much as possible.

Each worker process is automatically assigned the correct CPU cores and network interface cards to maximize performance. If UCX and distributed-ucxx are available, InfiniBand and NVLink connections can be used to optimize data transfer performance.

Parameters:

CUDA_VISIBLE_DEVICESstr, list of int, or None, default None: GPUs to restrict activity to. Can be a string (like "0,1,2,3"), list (like [0, 1, 2, 3]), or None to use all available GPUs.
n_workersint or None, default None: Number of workers. Can be an integer or None to fall back on the GPUs specified by CUDA_VISIBLE_DEVICES. The value of n_workers must be smaller or equal to the number of GPUs specified in CUDA_VISIBLE_DEVICES when the latter is specified, and if smaller, only the first n_workers GPUs will be used.
threads_per_workerint, default 1: Number of threads to be used for each Dask worker process.
memory_limitint, float, str, or None, default “auto”: Size of the host LRU cache, which is used to determine when the worker starts spilling to disk. Can be an integer (bytes), float (fraction of total system memory), string (like "5GB" or "5000M"), or "auto", 0, or None for no memory management.
device_memory_limitint, float, str, or None, default “default”: Size of the CUDA device LRU cache, which is used to determine when the worker starts spilling to host memory. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M"), "auto", 0 or None to disable spilling to host (i.e. allow full device memory usage). Another special value "default" (which happens to be the default) is also available and uses the recommended Dask-CUDA’s defaults and means 80% of the total device memory (analogous to 0.8), and disabled spilling (analogous to auto/0) on devices without a dedicated memory resource, such as system on a chip (SoC) devices.
enable_cudf_spillbool, default False: Enable automatic cuDF spilling.
cudf_spill_statsint, default 0: Set the cuDF spilling statistics level. This option has no effect if enable_cudf_spill=False.
local_directorystr or None, default None: Path on local machine to store temporary files. Can be a string (like "path/to/files") or None to fall back on the value of dask.temporary-directory in the local Dask configuration, using the current working directory if this is not set.
protocolstr or None, default None: Protocol to use for communication. Can be a string (like "tcp" or "ucx"), or None to automatically choose the correct protocol.
enable_tcp_over_ucxbool, default None: Set environment variables to enable TCP over UCX, even if InfiniBand and NVLink are not supported or disabled. Leave as None to use automatic UCX transport selection when protocol="ucx".
enable_infinibandbool, default None: Set environment variables to enable UCX over InfiniBand, requires protocol="ucx", and implies enable_tcp_over_ucx=True when True. Leave as None to use automatic UCX transport selection.
enable_nvlinkbool, default None: Set environment variables to enable UCX over NVLink, requires protocol="ucx", and implies enable_tcp_over_ucx=True when True. Leave as None to use automatic UCX transport selection.
enable_rdmacmbool, default None: Set environment variables to enable UCX RDMA connection manager support, requires protocol="ucx", and enable_infiniband=True.
rmm_pool_sizeint, str or None, default None: RMM pool size to initialize each worker with. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M"), or None to disable RMM pools.

Note

This size is a per-worker configuration, and not cluster-wide.
rmm_maximum_pool_sizeint, str or None, default None: When rmm_pool_size is set, this argument indicates the maximum pool size. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M") or None. By default, the total available memory on the GPU is used.

Note

When paired with --enable-rmm-async the maximum size cannot be guaranteed due to fragmentation.

Note

This size is a per-worker configuration, and not cluster-wide.
rmm_managed_memorybool, default False: Initialize each worker with RMM and set it to use managed memory. If disabled, RMM may still be used by specifying rmm_pool_size.

Warning

Managed memory is currently incompatible with NVLink. Trying to enable both will result in an exception.
rmm_async: bool, default False: Initialize each worker with RMM and set it to use RMM’s asynchronous allocator. See rmm.mr.CudaAsyncMemoryResource for more info.

Warning

The asynchronous allocator is incompatible with RMM pools and managed memory. Trying to enable both will result in an exception.
rmm_allocator_external_lib_list: str, list or None, default None: List of external libraries for which to set RMM as the allocator. Supported options are: ["torch", "cupy"]. Can be a comma-separated string (like "torch,cupy") or a list of strings (like ["torch", "cupy"]). If None, no external libraries will use RMM as their allocator.
rmm_release_threshold: int, str or None, default None: When rmm.async is True and the pool size grows beyond this value, unused memory held by the pool will be released at the next synchronization point. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M") or None. By default, this feature is disabled.

Note

This size is a per-worker configuration, and not cluster-wide.
rmm_log_directorystr or None, default None: Directory to write per-worker RMM log files to. The client and scheduler are not logged here. Can be a string (like "/path/to/logs/") or None to disable logging.

Note

Logging will only be enabled if rmm_pool_size is specified or rmm_managed_memory=True.
rmm_track_allocationsbool, default False: If True, wraps the memory resource used by each worker with a rmm.mr.TrackingResourceAdaptor, which tracks the amount of memory allocated.

Note

This option enables additional diagnostics to be collected and reported by the Dask dashboard. However, there is significant overhead associated with this and it should only be used for debugging and memory profiling.
log_spillingbool, default True: Enable logging of spilling operations directly to distributed.Worker with an INFO log level.
pre_importstr, list or None, default None: Pre-import libraries as a Worker plugin to prevent long import times bleeding through later Dask operations. Should be a list of comma-separated names, such as “cudf,rmm” or a list of strings such as [“cudf”, “rmm”].

Raises:

TypeError: If InfiniBand or NVLink are enabled and protocol != "ucx".
ValueError: If RMM pool, RMM managed memory or RMM async allocator are requested but RMM cannot be imported. If RMM managed memory and asynchronous allocator are both enabled. If RMM maximum pool size is set but RMM pool size is not. If RMM maximum pool size is set but RMM async allocator is used. If RMM release threshold is set but the RMM async allocator is not being used.

CLI#

Worker#

dask cuda#

Launch a distributed worker with GPUs attached to an existing scheduler.

A scheduler can be specified either through a URI passed through the SCHEDULER argument or a scheduler file passed through the --scheduler-file option.

See https://docs.rapids.ai/api/dask-cuda/stable/quickstart.html#dask-cuda-worker for info.

Usage

dask cuda [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...

Options

--host <host>#: IP address of serving host; should be visible to the scheduler and other workers. Can be a string (like "127.0.0.1") or None to fall back on the address of the interface specified by --interface or the default interface. See --listen-address and --contact-address if you need different listen and contact addresses.

--worker-port <worker_port>#: Serving computation port, defaults to random. When using multiple GPUs, a sequential range of ports may be specified like <first-port>:<last-port>. For example, --worker-port=9000:9003 will use ports 9000, 9001, 9002, 9003 across 4 GPUs.

--listen-address <listen_address>#: The address to which the worker binds. Example: tcp://0.0.0.0:9000 or tcp://:9000 for IPv4+IPv6

--contact-address <contact_address>#: The address the worker advertises to the scheduler for communication with it and other workers. Example: tcp://127.0.0.1:9000

--nthreads <nthreads>#

Number of threads to be used for each Dask worker process.

Default:: 1

--name <name>#: A unique name for the worker. Can be a string (like "worker-1") or None for a nameless worker.

--memory-limit <memory_limit>#

Size of the host LRU cache, which is used to determine when the worker starts spilling to disk. Can be an integer (bytes), float (fraction of total system memory), string (like "5GB" or "5000M"), or "auto" or 0 for no memory management.

Default:: 'auto'

--device-memory-limit <device_memory_limit>#

Size of the CUDA device LRU cache, which is used to determine when the worker starts spilling to host memory. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M"), "auto" or 0 to disable spilling to host (i.e. allow full device memory usage). Another special value "default" (which happens to be the default) is also available and uses the recommended Dask-CUDA’s defaults and means 80% of the total device memory (analogous to 0.8), and disabled spilling (analogous to auto/0) on devices without a dedicated memory resource, such as system on a chip (SoC) devices.

Default:: 'default'

--enable-cudf-spill, --disable-cudf-spill#

Enable automatic cuDF spilling.

Default:: False

--cudf-spill-stats <cudf_spill_stats>#: Set the cuDF spilling statistics level. This option has no effect if --enable-cudf-spill is not specified.

--rmm-pool-size <rmm_pool_size>#: RMM pool size to initialize each worker with. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M"), or None to disable RMM pools.

Note

This size is a per-worker configuration, and not cluster-wide.

--rmm-maximum-pool-size <rmm_maximum_pool_size>#: The maximum pool size, when using an RMM pool memory resource. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M") or None. By default, the total available memory on the GPU is used.

Note

When paired with --enable-rmm-async the maximum size cannot be guaranteed due to fragmentation.

Note

This size is a per-worker configuration, and not cluster-wide.

--rmm-managed-memory, --no-rmm-managed-memory#

Initialize each worker with RMM and set it to use managed memory. If disabled, RMM may still be used by specifying --rmm-pool-size.

Warning

Managed memory is currently incompatible with NVLink. Trying to enable both will result in failure.

Default:: False

--rmm-async, --no-rmm-async#

Initialize each worker with RMM and set it to use RMM’s asynchronous allocator. See rmm.mr.CudaAsyncMemoryResource for more info.

Warning

The asynchronous allocator is incompatible with RMM pools and managed memory, trying to enable both will result in failure.

Default:: False

--set-rmm-allocator-for-libs <rmm_allocator_external_lib_list>#

Set RMM as the allocator for external libraries. Provide a comma-separated list of libraries to set, e.g., “torch,cupy”.

Options:: cupy | torch

--rmm-release-threshold <rmm_release_threshold>#: When rmm.async is True and the pool size grows beyond this value, unused memory held by the pool will be released at the next synchronization point. Can be an integer (bytes), float (fraction of total device memory), string (like "5GB" or "5000M") or None. By default, this feature is disabled.

Note

This size is a per-worker configuration, and not cluster-wide.

--rmm-log-directory <rmm_log_directory>#: Directory to write per-worker RMM log files to. The client and scheduler are not logged here. Can be a string (like "/path/to/logs/") or None to disable logging.

Note

Logging will only be enabled if --rmm-pool-size or --rmm-managed-memory are specified.

--rmm-track-allocations, --no-rmm-track-allocations#

Track memory allocations made by RMM. If True, wraps the memory resource of each worker with a rmm.mr.TrackingResourceAdaptor that allows querying the amount of memory allocated by RMM.

Default:: False

--pid-file <pid_file>#: File to write the process PID.

--resources <resources>#: Resources for task constraints like "GPU=2 MEM=10e9".

--dashboard, --no-dashboard#

Launch the dashboard.

Default:: True

--dashboard-address <dashboard_address>#

Relative address to serve the dashboard (if enabled).

Default:: ':0'

--local-directory <local_directory>#: Path on local machine to store temporary files. Can be a string (like "path/to/files") or None to fall back on the value of dask.temporary-directory in the local Dask configuration, using the current working directory if this is not set.

--scheduler-file <scheduler_file>#: Filename to JSON encoded scheduler information. To be used in conjunction with the equivalent dask scheduler option.

--protocol <protocol>#: Protocol like tcp, tls, or ucx

--interface <interface>#: External interface used to connect to the scheduler. Usually an ethernet interface is used for connection, and not an InfiniBand interface (if one is available). Can be a string (like "eth0" for NVLink or "ib0" for InfiniBand) or None to fall back on the default interface.

--preload <preload>#: Module that should be loaded by each worker process like "foo.bar" or "/path/to/foo.py".

--death-timeout <death_timeout>#: Seconds to wait for a scheduler before closing

--dashboard-prefix <dashboard_prefix>#: Prefix for the dashboard. Can be a string (like …) or None for no prefix.

--tls-ca-file <tls_ca_file>#: CA certificate(s) file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no certificate(s).

--tls-cert <tls_cert>#: Certificate file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no certificate(s).

--tls-key <tls_key>#: Private key file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no private key.

--enable-tcp-over-ucx, --disable-tcp-over-ucx#: Override automatic UCX transport selection to enable or disable TCP over UCX. Prefer leaving this unset except when benchmarking a specific transport.

--enable-infiniband, --disable-infiniband#: Override automatic UCX transport selection to enable or disable InfiniBand. Prefer leaving this unset except when benchmarking a specific transport.

--enable-nvlink, --disable-nvlink#: Override automatic UCX transport selection to enable or disable NVLink. Prefer leaving this unset except when benchmarking a specific transport.

--enable-rdmacm, --disable-rdmacm#: Override automatic UCX transport selection to enable or disable UCX RDMA connection manager support, requires --enable-infiniband when enabled.

--worker-class <worker_class>#: Use a different class than Distributed’s default (distributed.Worker) to spawn distributed.Nanny.

--pre-import <pre_import>#: Pre-import libraries as a Worker plugin to prevent long import times bleeding through later Dask operations. Should be a list of comma-separated names, such as “cudf,rmm”.

--multiprocessing-method <multiprocessing_method>#

Method used to start new processes with multiprocessing

Options:: spawn | fork | forkserver

Arguments

SCHEDULER#: Optional argument

PRELOAD_ARGV#: Optional argument(s)

Cluster configuration#

dask cuda#

Query an existing GPU cluster’s configuration.

A cluster can be specified either through a URI passed through the SCHEDULER argument or a scheduler file passed through the --scheduler-file option.

Usage

dask cuda [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...

Options

--scheduler-file <scheduler_file>#: Filename to JSON encoded scheduler information. To be used in conjunction with the equivalent dask scheduler option.

--tls-ca-file <tls_ca_file>#: CA certificate(s) file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no certificate(s).

--tls-cert <tls_cert>#: Certificate file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no certificate(s).

--tls-key <tls_key>#: Private key file for TLS (in PEM format). Can be a string (like "path/to/certs"), or None for no private key.

Arguments

SCHEDULER#: Optional argument

PRELOAD_ARGV#: Optional argument(s)

Client initialization#

dask_cuda.initialize.initialize( create_cuda_context=True, enable_tcp_over_ucx=None, enable_infiniband=None, enable_nvlink=None, enable_rdmacm=None, )[source]#

Create CUDA context and initialize UCXX configuration.

Sometimes it is convenient to initialize the CUDA context, particularly before starting up Dask worker processes which create a variety of threads.

To ensure UCX works correctly, it is important to ensure it is initialized with the correct options. This is especially important for the client, which cannot be configured to use UCX with arguments like LocalCUDACluster and dask cuda worker. This function will ensure that they are provided a UCX configuration based on the flags and options passed by the user.

This function can also be used within a worker preload script for UCX configuration of mainline Dask.distributed. https://docs.dask.org/en/latest/setup/custom-startup.html

You can add it to your global config with the following YAML:

distributed:
  worker:
    preload:
      - dask_cuda.initialize

See https://docs.dask.org/en/latest/configuration.html for more information about Dask configuration.

Parameters:

create_cuda_contextbool, default True: Create CUDA context on initialization.
enable_tcp_over_ucxbool, default None: Set environment variables to enable TCP over UCX, even if InfiniBand and NVLink are not supported or disabled. Leave as None to use automatic UCX transport selection.
enable_infinibandbool, default None: Set environment variables to enable UCX over InfiniBand, implies enable_tcp_over_ucx=True when True. Leave as None to use automatic UCX transport selection.
enable_nvlinkbool, default None: Set environment variables to enable UCX over NVLink, implies enable_tcp_over_ucx=True when True. Leave as None to use automatic UCX transport selection.
enable_rdmacmbool, default None: Set environment variables to enable UCX RDMA connection manager support, requires enable_infiniband=True.