User Guide
Achieving optimal performance in GPU-centric workflows frequently requires customizing how GPU (”device”) memory is allocated.
RMM is a package that enables you to allocate device memory in a highly configurable way. For example, it enables you to allocate and use pools of GPU memory, or to use managed memory for allocations.
You can also easily configure other libraries like Numba and CuPy to use RMM for allocating device memory.
Installation
See the project README for how to install RMM.
Using RMM
There are two ways to use RMM in Python code:
Using the
rmm.DeviceBufferAPI to explicitly create and manage device memory allocationsTransparently via external libraries such as CuPy and Numba
RMM provides a MemoryResource abstraction to control how device
memory is allocated in both the above uses.
DeviceBuffer Objects
A DeviceBuffer represents an untyped, uninitialized device memory
allocation. DeviceBuffers can be created by providing the
size of the allocation in bytes:
>>> import rmm
>>> buf = rmm.DeviceBuffer(size=100)
The size of the allocation and the memory address associated with it
can be accessed via the .size and .ptr attributes respectively:
>>> buf.size
100
>>> buf.ptr
140202544726016
DeviceBuffers can also be created by copying data from host memory:
>>> import rmm
>>> import numpy as np
>>> a = np.array([1, 2, 3], dtype='float64')
>>> buf = rmm.DeviceBuffer.to_device(a.view("uint8")) # to_device expects an unsigned 8-bit dtype
>>> buf.size
24
Conversely, the data underlying a DeviceBuffer can be copied to the host:
>>> np.frombuffer(buf.tobytes())
array([1., 2., 3.])
Prefetching a DeviceBuffer
CUDA Unified Memory, also known as managed memory, can be allocated using an
rmm.mr.ManagedMemoryResource explicitly, or by calling rmm.reinitialize
with managed_memory=True.
A DeviceBuffer backed by managed memory or other
migratable memory (such as
HMM/ATS
memory) may be prefetched to a specified device, for example to reduce or eliminate page faults.
>>> import rmm
>>> rmm.reinitialize(managed_memory=True)
>>> buf = rmm.DeviceBuffer(size=100)
>>> buf.prefetch()
The above example prefetches the DeviceBuffer memory to the current CUDA device
on the stream that the DeviceBuffer last used (e.g. at construction). The
destination device ID and stream are optional parameters.
>>> import rmm
>>> rmm.reinitialize(managed_memory=True)
>>> from rmm.pylibrmm.stream import Stream
>>> stream = Stream()
>>> buf = rmm.DeviceBuffer(size=100, stream=stream)
>>> buf.prefetch(device=3, stream=stream) # prefetch to device on stream.
DeviceBuffer.prefetch() is a no-op if the DeviceBuffer is not backed
by migratable memory.
rmm.pylibrmm.stream.Stream implements the CUDA Stream Protocol, so it can be used with
cuda.core..
>>> from cuda.core.experimental import Device
>>> import rmm.pylibrmm.stream
>>> device = Device()
>>> device.set_current()
>>> rmm_stream = rmm.pylibrmm.stream.Stream()
>>> cuda_stream = device.create_stream(rmm_stream)
MemoryResource objects
MemoryResource objects are used to configure how device memory allocations are made by
RMM.
By default if a MemoryResource is not set explicitly, RMM uses the CudaMemoryResource, which
uses cudaMalloc for allocating device memory.
rmm.reinitialize() provides an easy way to initialize RMM with specific memory resource options
across multiple devices. See help(rmm.reinitialize) for full details.
For lower-level control, the rmm.mr.set_current_device_resource() function can be
used to set a different MemoryResource for the current CUDA device. For
example, enabling the ManagedMemoryResource tells RMM to use
cudaMallocManaged instead of cudaMalloc for allocating memory:
>>> import rmm
>>> rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())
:warning: The default resource must be set for any device before allocating any device memory on that device. Setting or changing the resource after device allocations have been made can lead to unexpected behaviour or crashes.
As another example, PoolMemoryResource allows you to allocate a
large “pool” of device memory up-front. Subsequent allocations will
draw from this pool of already allocated memory. The example
below shows how to construct a PoolMemoryResource with an initial size
of 1 GiB and a maximum size of 4 GiB. The pool uses
CudaMemoryResource as its underlying (”upstream”) memory resource:
>>> import rmm
>>> pool = rmm.mr.PoolMemoryResource(
... rmm.mr.CudaMemoryResource(),
... initial_pool_size="1GiB", # equivalent to initial_pool_size=2**30
... maximum_pool_size="4GiB"
... )
>>> rmm.mr.set_current_device_resource(pool)
Similarly, to use a pool of managed memory:
>>> import rmm
>>> pool = rmm.mr.PoolMemoryResource(
... rmm.mr.ManagedMemoryResource(),
... initial_pool_size="1GiB",
... maximum_pool_size="4GiB"
... )
>>> rmm.mr.set_current_device_resource(pool)
Other MemoryResources include:
FixedSizeMemoryResourcefor allocating fixed blocks of memoryBinningMemoryResourcefor allocating blocks within specified “bin” sizes from different memory resources
MemoryResources are highly configurable and can be composed together in different ways.
See help(rmm.mr) for more information.
Using RMM with third-party libraries
A number of libraries provide hooks to control their device
allocations. RMM provides implementations of these for
CuPy,
numba, and PyTorch in the
rmm.allocators submodule. All these approaches configure the library
to use the current RMM memory resource for device
allocations.
Using RMM with CuPy
You can configure CuPy to use RMM for memory
allocations by setting the CuPy CUDA allocator to
rmm.allocators.cupy.rmm_cupy_allocator:
>>> from rmm.allocators.cupy import rmm_cupy_allocator
>>> import cupy
>>> cupy.cuda.set_allocator(rmm_cupy_allocator)
Using RMM with Numba
You can configure Numba to use RMM for memory allocations using the Numba EMM Plugin.
This can be done in two ways:
Setting the environment variable
NUMBA_CUDA_MEMORY_MANAGER:
$ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args)
Using the
set_memory_manager()function provided by Numba:
>>> from numba import cuda
>>> from rmm.allocators.numba import RMMNumbaManager
>>> cuda.set_memory_manager(RMMNumbaManager)
Using RMM with PyTorch
You can configure PyTorch to use RMM for memory allocations using their by configuring the current allocator.
>>> from rmm.allocators.torch import rmm_torch_allocator
>>> import torch
>>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
Memory statistics and profiling
RMM can profile memory usage and track memory statistics by using either of the following:
Use the context manager
rmm.statistics.statistics()to enable statistics tracking for a specific code block.Call
rmm.statistics.enable_statistics()to enable statistics tracking globally.
Common to both usages is that they modify the currently active RMM memory resource. The current device resource is wrapped with a StatisticsResourceAdaptor which must remain the topmost resource throughout the statistics tracking:
>>> import rmm
>>> import rmm.statistics
>>> # We start with the default cuda memory resource
>>> rmm.mr.get_current_device_resource()
<rmm.pylibrmm.memory_resource.CudaMemoryResource object at 0x7fa0da48a8e0>
>>> # When using statistics, we get a StatisticsResourceAdaptor with the context
>>> with rmm.statistics.statistics():
... rmm.mr.get_current_device_resource()
<rmm.pylibrmm.memory_resource.StatisticsResourceAdaptor object at 0x7fa0dd6e4a40>
>>> # We can also enable statistics globally
>>> rmm.statistics.enable_statistics()
>>> print(rmm.mr.get_current_device_resource())
<rmm.pylibrmm.memory_resource.StatisticsResourceAdaptor object at 0x7f9a11340a40>
With statistics enabled, you can query statistics of the current and peak bytes and number of allocations performed by the current RMM memory resource:
>>> buf = rmm.DeviceBuffer(size=10)
>>> rmm.statistics.get_statistics()
Statistics(current_bytes=16, current_count=1, peak_bytes=16, peak_count=1, total_bytes=16, total_count=1)
Memory Profiler
To profile a specific block of code, first enable memory statistics by calling rmm.statistics.enable_statistics(). To profile a function, use profiler as a function decorator:
>>> @rmm.statistics.profiler()
... def f(size):
... rmm.DeviceBuffer(size=size)
>>> f(1000)
>>> # By default, the profiler write to rmm.statistics.default_profiler_records
>>> print(rmm.statistics.default_profiler_records.report())
Memory Profiling
================
Legends:
ncalls - number of times the function or code block was called
memory_peak - peak memory allocated in function or code block (in bytes)
memory_total - total memory allocated in function or code block (in bytes)
Ordered by: memory_peak
ncalls memory_peak memory_total filename:lineno(function)
1 1,008 1,008 <ipython-input-11-5fc63161ac29>:1(f)
To profile a code block, use profiler as a context manager:
>>> with rmm.statistics.profiler(name="my code block"):
... rmm.DeviceBuffer(size=20)
>>> print(rmm.statistics.default_profiler_records.report())
Memory Profiling
================
Legends:
ncalls - number of times the function or code block was called
memory_peak - peak memory allocated in function or code block (in bytes)
memory_total - total memory allocated in function or code block (in bytes)
Ordered by: memory_peak
ncalls memory_peak memory_total filename:lineno(function)
1 1,008 1,008 <ipython-input-11-5fc63161ac29>:1(f)
1 32 32 my code block
The profiler supports nesting:
>>> with rmm.statistics.profiler(name="outer"):
... buf1 = rmm.DeviceBuffer(size=10)
... with rmm.statistics.profiler(name="inner"):
... buf2 = rmm.DeviceBuffer(size=10)
>>> print(rmm.statistics.default_profiler_records.report())
Memory Profiling
================
Legends:
ncalls - number of times the function or code block was called
memory_peak - peak memory allocated in function or code block (in bytes)
memory_total - total memory allocated in function or code block (in bytes)
Ordered by: memory_peak
ncalls memory_peak memory_total filename:lineno(function)
1 1,008 1,008 <ipython-input-4-865fbe04e29f>:1(f)
1 32 32 my code block
1 32 32 outer
1 16 16 inner