Classes | Functions
rapidsmpf::rrun Namespace Reference

Classes

struct  bind_options
 Options controlling which topology-based resource bindings to apply. More...
 
struct  resource_binding
 Live resource binding configuration collected from the running process. More...
 
struct  expected_binding
 Expected resource binding derived from topology information. More...
 
struct  binding_validation
 Results of validating actual vs. expected resource bindings. More...
 
class  ScopedEnvVar
 RAII guard that saves, optionally modifies, and restores an environment variable. More...
 

Functions

resource_binding check_binding (int gpu_id_hint=-1)
 Collect the live resource binding of the calling process. More...
 
std::optional< expected_bindingget_expected_binding (cucascade::memory::system_topology_info const &topology, int gpu_id)
 Obtain the expected binding for a GPU from pre-discovered topology. More...
 
binding_validation validate_binding (resource_binding const &actual, expected_binding const &expected)
 Validate an actual resource binding against an expected one. More...
 
void bind (std::optional< unsigned int > gpu_id=std::nullopt, bind_options const &options={})
 Bind the calling process to resources topologically close to a GPU. More...
 
void bind (cucascade::memory::system_topology_info const &topology, std::optional< unsigned int > gpu_id=std::nullopt, bind_options const &options={})
 Bind using pre-discovered topology information. More...
 

Detailed Description

SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0

Function Documentation

◆ bind() [1/2]

void rapidsmpf::rrun::bind ( cucascade::memory::system_topology_info const &  topology,
std::optional< unsigned int >  gpu_id = std::nullopt,
bind_options const &  options = {} 
)

Bind using pre-discovered topology information.

Same as the other overload, but skips the topology discovery step by reusing a previously obtained system_topology_info. Useful when the caller has already performed discovery (e.g., in a parent process before forking).

GPU resolution follows the same order as the other overload (explicit gpu_id, then CUDA_VISIBLE_DEVICES).

Warning
This function is not thread-safe. It mutates process-wide state (CPU affinity, NUMA memory policy, and the UCX_NET_DEVICES environment variable). It should be called exactly once per process, ideally early in initialization and before other threads are spawned.
Parameters
topologyPre-discovered system topology.
gpu_idGPU device index to bind for. When std::nullopt, the first GPU in CUDA_VISIBLE_DEVICES is used instead.
optionsControls which resource bindings to apply.
Exceptions
std::runtime_errorif no GPU ID can be determined, the resolved GPU is not found in topology, an enabled binding (CPU affinity, NUMA memory policy, network devices) could not be applied, or post-bind verification detects a mismatch between the requested and actual binding state.

◆ bind() [2/2]

void rapidsmpf::rrun::bind ( std::optional< unsigned int >  gpu_id = std::nullopt,
bind_options const &  options = {} 
)

Bind the calling process to resources topologically close to a GPU.

Discovers the system topology via cucascade::memory::topology_discovery, then applies CPU affinity, NUMA memory binding, and/or network device configuration as requested in options.

This is the self-contained entry point intended for external libraries that do not launch through the rrun CLI.

Warning
This function is not thread-safe. It temporarily modifies the CUDA_VISIBLE_DEVICES environment variable during topology discovery and mutates process-wide state (CPU affinity, NUMA memory policy, and the UCX_NET_DEVICES environment variable). It should be called exactly once per process, ideally early in initialization and before other threads are spawned.

GPU resolution order:

  1. Use gpu_id if provided.
  2. Otherwise, parse the first entry of the CUDA_VISIBLE_DEVICES environment variable.
  3. If neither is available, throw std::runtime_error.
Parameters
gpu_idGPU device index (as reported by nvidia-smi) to bind for. When std::nullopt, the first GPU in CUDA_VISIBLE_DEVICES is used instead.
optionsControls which resource bindings to apply.
Exceptions
std::runtime_errorif no GPU ID can be determined, topology discovery fails, the resolved GPU is not found in the discovered topology, an enabled binding (CPU affinity, NUMA memory policy, network devices) could not be applied, or post-bind verification detects a mismatch between the requested and actual binding state.

◆ check_binding()

resource_binding rapidsmpf::rrun::check_binding ( int  gpu_id_hint = -1)

Collect the live resource binding of the calling process.

Queries the current CPU affinity, NUMA memory nodes, UCX network device configuration, process rank, and GPU information. Fields that cannot be determined (e.g. rank when no launcher environment is set, or GPU ID when CUDA_VISIBLE_DEVICES is absent and no hint is given) are left at their default value of -1.

Parameters
gpu_id_hintGPU device index hint. When >= 0 the value is stored directly; otherwise the GPU ID is read from CUDA_VISIBLE_DEVICES. When a valid GPU ID is available, the PCI bus ID is also queried.
Returns
The collected resource binding.

◆ get_expected_binding()

std::optional<expected_binding> rapidsmpf::rrun::get_expected_binding ( cucascade::memory::system_topology_info const &  topology,
int  gpu_id 
)

Obtain the expected binding for a GPU from pre-discovered topology.

Looks up gpu_id in topology and returns the expected CPU affinity, memory binding, and network devices.

Parameters
topologyPre-discovered system topology.
gpu_idGPU device index to look up.
Returns
The expected binding, or std::nullopt if gpu_id is not found.

◆ validate_binding()

binding_validation rapidsmpf::rrun::validate_binding ( resource_binding const &  actual,
expected_binding const &  expected 
)

Validate an actual resource binding against an expected one.

Compares the live actual binding with expected and reports per-resource pass/fail status.

Parameters
actualLive resource binding (from check_binding()).
expectedExpected binding (from topology or a JSON file).
Returns
Validation results.