Multi-GPU Nearest Neighbors#

Multi-GPU support in cuVS enables scaling ANN (Approximate Nearest Neighbors) algorithms across multiple GPUs on a single node, providing improved performance and the ability to handle larger datasets.

Overview#

The multi-GPU implementations extend the single-GPU algorithms to work across multiple GPUs using two main distribution strategies:

Replicated Mode: The entire index is replicated across all GPUs. This mode provides higher query throughput by distributing queries across GPUs while maintaining the full index on each GPU.
Sharded Mode: The index is partitioned (sharded) across GPUs. This mode allows handling larger datasets that don’t fit on a single GPU by distributing the data across multiple GPUs.

Important Notes#

Warning

Memory Requirements: Multi-GPU algorithms require all data to be in host memory (CPU). This is different from single-GPU algorithms that typically work with device memory.

Note

Supported Algorithms: Currently, multi-GPU support is available for:

CAGRA (Graph-based ANN)
IVF-Flat (Inverted File with Flat storage)
IVF-PQ (Inverted File with Product Quantization)

Configuration Options#

Distribution Modes#

Replicated Mode

In replicated mode, the complete index is stored on each GPU. This approach:
- Maximizes query throughput by processing queries in parallel across all GPUs
- Requires each GPU to have enough memory to store the entire index
- Is ideal for scenarios where query throughput is more important than index size limitations
Sharded Mode

In sharded mode, the index is distributed across GPUs. This approach:
- Enables handling of larger datasets by partitioning across GPUs
- Requires coordination between GPUs during search operations
- Is ideal for scenarios where the dataset is too large for a single GPU

Search Modes#

Load Balancer

Divides each query across multiple GPUs, distributing workload efficiently to maximize performance and throughput.
Round Robin

Distributes queries evenly across GPUs in a rotating sequence, ensuring balanced workload allocation. This mode is best suited for frequent, small-scale search operations.

Merge Modes#

Merge on Root Rank

Results from all GPUs are collected and merged on the root rank (typically GPU 0).
Tree Merge

Results are merged in a tree-like fashion across GPUs to reduce communication overhead.

Usage Examples#

Basic Multi-GPU Usage#

import numpy as np
from cuvs.neighbors import mg_cagra

# Create dataset in host memory
n_samples = 100000
n_features = 128
dataset = np.random.random_sample((n_samples, n_features), dtype=np.float32)

# Build multi-GPU index
build_params = mg_cagra.IndexParams(
    distribution_mode="sharded",
    metric="sqeuclidean"
)
index = mg_cagra.build(build_params, dataset)

# Search with multi-GPU
queries = np.random.random_sample((1000, n_features), dtype=np.float32)
search_params = mg_cagra.SearchParams(
    search_mode="load_balancer",
    merge_mode="merge_on_root_rank"
)
distances, neighbors = mg_cagra.search(search_params, index, queries, k=10)

Multi-GPU Nearest Neighbors#

Overview#

Important Notes#

Configuration Options#

Distribution Modes#

Search Modes#

Merge Modes#

Usage Examples#

Basic Multi-GPU Usage#

Algorithm-Specific Documentation#