Multi-GPU Nearest Neighbors#

Multi-GPU support in cuVS enables scaling ANN (Approximate Nearest Neighbors) algorithms across multiple GPUs on a single node, providing improved performance and the ability to handle larger datasets.

Overview#

The multi-GPU implementations extend the single-GPU algorithms to work across multiple GPUs using two main distribution strategies:

  • Replicated Mode: The entire index is replicated across all GPUs. This mode provides higher query throughput by distributing queries across GPUs while maintaining the full index on each GPU.

  • Sharded Mode: The index is partitioned (sharded) across GPUs. This mode allows handling larger datasets that don’t fit on a single GPU by distributing the data across multiple GPUs.

Important Notes#

Warning

Memory Requirements: Multi-GPU algorithms require all data to be in host memory (CPU). This is different from single-GPU algorithms that typically work with device memory.

Note

Supported Algorithms: Currently, multi-GPU support is available for:

  • CAGRA (Graph-based ANN)

  • IVF-Flat (Inverted File with Flat storage)

  • IVF-PQ (Inverted File with Product Quantization)

Configuration Options#

Distribution Modes#

  • Replicated Mode

    In replicated mode, the complete index is stored on each GPU. This approach:

    • Maximizes query throughput by processing queries in parallel across all GPUs

    • Requires each GPU to have enough memory to store the entire index

    • Is ideal for scenarios where query throughput is more important than index size limitations

  • Sharded Mode

    In sharded mode, the index is distributed across GPUs. This approach:

    • Enables handling of larger datasets by partitioning across GPUs

    • Requires coordination between GPUs during search operations

    • Is ideal for scenarios where the dataset is too large for a single GPU

Search Modes#

  • Load Balancer

    Divides each query across multiple GPUs, distributing workload efficiently to maximize performance and throughput.

  • Round Robin

    Distributes queries evenly across GPUs in a rotating sequence, ensuring balanced workload allocation. This mode is best suited for frequent, small-scale search operations.

Merge Modes#

  • Merge on Root Rank

    Results from all GPUs are collected and merged on the root rank (typically GPU 0).

  • Tree Merge

    Results are merged in a tree-like fashion across GPUs to reduce communication overhead.

Usage Examples#

Basic Multi-GPU Usage#

import numpy as np
from cuvs.neighbors import mg_cagra

# Create dataset in host memory
n_samples = 100000
n_features = 128
dataset = np.random.random_sample((n_samples, n_features), dtype=np.float32)

# Build multi-GPU index
build_params = mg_cagra.IndexParams(
    distribution_mode="sharded",
    metric="sqeuclidean"
)
index = mg_cagra.build(build_params, dataset)

# Search with multi-GPU
queries = np.random.random_sample((1000, n_features), dtype=np.float32)
search_params = mg_cagra.SearchParams(
    search_mode="load_balancer",
    merge_mode="merge_on_root_rank"
)
distances, neighbors = mg_cagra.search(search_params, index, queries, k=10)

Algorithm-Specific Documentation#