Advanced Topics =============== Here we cover a few assorted topics that may be of interest to more advanced use cases. CUDA Streams and Synchronization -------------------------------- Functions and methods in cuML are written using a variety of technologies. As such, while *most* methods run on the CUDA `per-thread default stream`_ (PTDS), some methods might run on the `legacy default stream`_ (also known as the NULL stream) instead. cuML does not currently expose stream selection as part of its public API and makes no guarantees on whether a particular method runs on the PTDS or legacy default stream. Likewise there is no guarantee that the output of a cuML method or function has been synchronized before returning. For users, if you follow the following guideline you shouldn't have any concurrency issues: - Device memory input arrays should be either fully computed, or currently computing on the PTDS or legacy default stream. - Device memory output arrays should be operated on using either the PTDS or legacy default stream, OR have the PTDS of the thread that ran the method synchronized before further access. - Inputs and outputs using host memory have no restrictions and shouldn't be prone to concurrency issues. Selecting the CUDA Device ------------------------- All single-GPU cuML methods run on device 0 by default. Setting a device via the :class:`cupy.cuda.Device` or :class:`cuda.core.Device` APIs is currently not supported. To specify a device to run on, we recommend using the ``CUDA_VISIBLE_DEVICES`` (`doc `_) environment variable. For example: .. code-block:: shell CUDA_VISIBLE_DEVICES=2 python myscript.py cuML does contain a few single-node multi-GPU implementations. When available, these take a ``device_ids`` parameter to specify which devices to run on. See the `cuml.manifold.UMAP` docs for an example. Configuring the Memory Allocator -------------------------------- Memory allocations in cuML are made using the `Rapids Memory Manager`_ (RMM). We don't do any configuration of RMM on import; allocations are made using the default memory resource (:class:`rmm.mr.CudaMemoryResource`). Some applications may run better using an alternative memory resource. A few common options: - A good default to try is the :class:`rmm.mr.CudaAsyncMemoryResource`. This is a stream-ordered pooling resource, and may be faster for your application. .. code-block:: python import rmm rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource()) - Users working with large data may want to enable cuML to use `CUDA Unified Memory`_ to enable GPU memory oversubscription. To do this, we recommend using :class:`rmm.mr.ManagedMemoryResource` wrapped in a :class:`rmm.mr.PrefetchResourceAdaptor` to minimize paging overhead. .. code-block:: python import rmm rmm.mr.set_current_device_resource( rmm.mr.PrefetchResourceAdaptor(rmm.mr.ManagedMemoryResource()) ) For more details, see the `RMM documentation`_. .. _per-thread default stream: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html#per-thread-default-stream .. _legacy default stream: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html#legacy-default-stream .. _Rapids Memory Manager: .. _RMM documentation: https://docs.rapids.ai/api/rmm/stable/ .. _CUDA Unified Memory: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html