Vector Search in C++ Tutorial#

Table of Contents#

RAFT has several important algorithms for performing vector search on the GPU and this tutorial walks through the primary vector search APIs from start to finish to provide a reference for quick setup and C++ API usage.

This tutorial assumes RAFT has been installed and/or added to your build so that you are able to compile and run RAFT code. If not done already, please follow the build and install instructions and consider taking a look at the example c++ template project for ready-to-go examples that you can immediately build and start playing with. Also take a look at RAFT’s library of reproducible vector search benchmarks to run benchmarks that compare RAFT against other state-of-the-art nearest neighbors algorithms at scale.

For more information about the various APIs demonstrated in this tutorial, along with comprehensive usage examples of all the APIs offered by RAFT, please refer to the RAFT’s C++ API Documentation.

Step 1: Starting off with RAFT#

CUDA Development?#

If you are reading this tuturial then you probably know about CUDA and its relationship to general-purpose GPU computing (GPGPU). You probably also know about Nvidia GPUs but might not necessarily be familiar with the programming model nor GPU computing. The good news is that extensive knowledge of CUDA and GPUs are not needed in order to get started with or build applications with RAFT. RAFT hides away most of the complexities behind simple single-threaded stateless functions that are inherently asynchronous, meaning the result of a computation isn’t necessarily read to be used when the function executes and control is given back to the user. The functions are, however, allowed to be chained together in a sequence of calls that don’t need to wait for subsequent computations to complete in order to continue execution. In fact, the only time you need to wait for the computation to complete is when you are ready to use the result.

A common structure you will encounter when using RAFT is a raft::device_resources object. This object is a container for important resources for a single GPU that might be needed during computation. If communicating with multiple GPUs, multiple device_resources might be needed, one for each GPU. device_resources contains several methods for managing its state but most commonly, you’ll call the sync_stream() to guarantee all recently submitted computation has completed (as mentioned above.)

A simple example of using raft::device_resources in RAFT:

#include <raft/core/device_resources.hpp>

raft::device_resources res;
// Call a bunch of RAFT functions in sequence...

Host vs Device Memory#

We differentiate between two different types of memory. host memory is your traditional RAM memory that is primarily accessible by applications on the CPU. device memory, on the other hand, is what we call the special memory on the GPU, which is not accessible from the CPU. In order to access host memory from the GPU, it needs to be explicitly copied to the GPU and in order to access device memory by the CPU, it needs to be explicitly copied there. We have several mechanisms available for allocating and managing the lifetime of device memory on the stack so that we don’t need to explicitly allocate and free pointers on the heap. For example, instead of a std::vector for host memory, we can use rmm::device_uvector on the device. The following function will copy an array from host memory to device memory:

#include <raft/util/cudart_utils.hpp>
#include <rmm/device_uvector.hpp>
#include <vector>

raft::device_resources res;

std::vector<int> my_host_vector = {0, 1, 2, 3, 4};
rmm::device_uvector<int> my_device_vector(my_host_vector.size(), res.get_stream());

raft::copy(,, my_host_vector.size(), res.get_stream());

Since a stream is involved in the copy operation above, RAFT functions can be invoked immediately so long as the same device_resources instances is used (or, more specifically, the same main stream from the devices_resources.) As you might notice in the example above, res.get_stream() can be used to extract the main stream from a device_resources instance.

Multi-dimensional data representation#

rmm::device_uvector is a great mechanism for allocating and managing a chunk of device memory. While it’s possible to use a single array to represent objects in higher dimensions like matrices, it lacks the means to pass that information along. For example, in addition to knowing that we have a 2d structure, we would need to know the number of rows, the number of columns, and even whether we read the columns or rows first (referred to as column- or row-major respectively).

For this reason, RAFT relies on the mdspan standard, which was composed specifically for this purpose. To be even more, mdspan itself doesn’t actually allocate or own any data on host or device because it’s just a view over an existing memory on host device. The mdspan simply gives us a way to represent multi-dimensional data so we can pass along the needed metadata to our APIs. Even more powerful is that we can design functions that only accept a matrix of float in device memory that is laid out in row-major format.

The memory-owning counterpart to the mdspan is the mdarray and the mdarray can allocate memory on device or host and carry along with it the metadata about its shape and layout. An mdspan can be produced from an mdarray for invoking RAFT APIs with mdarray.view(). They also follow similar paradigms to the STL, where we represent an immutable mdspan of int using mdspan<const int> instead of const mdspan<int> to ensure it’s the type carried along by the mdspan that’s not allowed to change.

Many RAFT functions require mdspan<const T> to represent immutable input data and there’s no implicit conversion between mdspan<T> and mdspan<const T> we use raft::make_const_mdspan() to alleviate the pain of constructing a new mdspan to invoke these functions.

The following example demonstrates how to create mdarray matrices in both device and host memory, copy one to the other, and create mdspans out of them:

#include <raft/core/device_mdarray.hpp>
#include <raft/core/host_mdarray.hpp>
#include <raft/core/copy.hpp>

raft::device_resources res;

int n_rows = 10;
int n_cols = 10;

auto device_matrix = raft::make_device_matrix<float>(res, n_rows, n_cols);
auto host_matrix = raft::make_host_matrix<float>(res, n_rows, n_cols);

// Set the diagonal to 1
for(int i = 0; i < n_rows; i++) {
    host_matrix(i, i) = 1;

raft::copy(res, device_matrix.view(), host_matrix.view());

Step 2: Generate some data#

Let’s build upon the fundamentals from the prior section and actually invoke some of RAFT’s computational APIs on the device. A good starting point is data generation.

#include <raft/core/device_mdarray.hpp>
#include <raft/random/make_blobs.cuh>

raft::device_resources res;

int n_rows = 10000;
int n_cols = 10000;

auto dataset = raft::make_device_matrix<float, int64_t>(res, n_rows, n_cols);
auto labels = raft::make_device_vector<int64_t, int64_t>(res, n_rows);

raft::random::make_blobs(res, dataset.view(), labels.view());

That’s it. We’ve now generated a random 10kx10k matrix with points that cleanly separate into Gaussian clusters, along with a vector of cluster labels for each of the data points. Notice the cuh extension in the header file include for make_blobs. This signifies to us that this file contains CUDA device functions like kernel code so the CUDA compiler, nvcc is needed in order to compile any code that uses it. Generally, any source files that include headers with a cuh extension use the .cu extension instead of .cpp. The rule here is that cpp source files contain code which can be compiled with a C++ compiler like g++ while cu files require the CUDA compiler.

Since the make_blobs code generates the random dataset on the GPU device, we didn’t need to do any host to device copies in this one. make_blobs is also asynchronous, so if we don’t need to copy and use the data in host memory right away, we can continue calling RAFT functions with the device_resources instance and the data transformations will all be scheduled on the same stream.

Step 3: Using brute-force indexes#

Build brute-force index#

Consider the (10k, 10k) shaped random matrix we generated in the previous step. We want to be able to find the k-nearest neighbors for all points of the matrix, or what we refer to as the all-neighbors graph, which means finding the neighbors of all data points within the same matrix.

#include <raft/neighbors/brute_force.cuh>

raft::device_resources res;

// set number of neighbors to search for
int const k = 64;

auto bfknn_index = raft::neighbors::brute_force::build(res,

Query brute-force index#

// using matrix `dataset` from previous example
auto search = raft::make_const_mdspan(dataset.view());

// Indices and Distances are of dimensions (n, k)
// where n is number of rows in the search matrix
auto reference_indices = raft::make_device_matrix<int, int64_t>(res, search.extent(0), k); // stores index of neighbors
auto reference_distances = raft::make_device_matrix<float, int64_t>(res, search.extent(0), k); // stores distance to neighbors


We have established several things here by building a flat index. Now we know the exact 64 neighbors of all points in the matrix, and this algorithm can be generally useful in several ways:

  1. Creating a baseline to compare against when building an approximate nearest neighbors index.

  2. Directly using the brute-force algorithm when accuracy is more important than speed of computation. Don’t worry, our implementation is still the best in-class and will provide not only significant speedups over other brute force methods, but also be quick relatively when the matrices are small!

Step 4: Using the ANN indexes#

Build a CAGRA index#

Next we’ll train an ANN index. We’ll use our graph-based CAGRA algorithm for this example but the other index types use a very similar pattern.

#include <raft/neighbors/cagra.cuh>

raft::device_resources res;

// use default index parameters
raft::neighbors::cagra::index_params index_params;

auto index = raft::neighbors::cagra::build<float, uint32_t>(res, index_params, raft::make_const_mdspan(dataset.view()));

Query the CAGRA index#

Now that we’ve trained a CAGRA index, we can query it by first allocating our output mdarray objects and passing the trained index model into the search function.

// create output arrays
auto indices = raft::make_device_matrix<uint32_t>(res, n_rows, k);
auto distances = raft::make_device_matrix<float>(res, n_rows, k);

// use default search parameters
raft::neighbors::cagra::search_params search_params;

// search K nearest neighbors
raft::neighbors::cagra::search<float, uint32_t>(
res, search_params, index, search, indices.view(), distances.view());

Step 5: Evaluate neighborhood quality#

In step 3 we built a flat index and queried for exact neighbors while in step 4 we build an ANN index and queried for approximate neighbors. How do you quickly figure out the quality of our approximate neighbors and whether it’s in an acceptable range based on your needs? Just compute the neighborhood_recall which gives a single value in the range [0, 1]. Closer the value to 1, higher the quality of the approximation.

#include <raft/stats/neighborhood_recall.cuh>

raft::device_resources res;

// Assuming matrices as type raft::device_matrix_view and variables as
// indices : approximate neighbor indices
// reference_indices : exact neighbor indices
// distances : approximate neighbor distances
// reference_distances : exact neighbor distances

// We want our `neighborhood_recall` value in host memory
float const recall_scalar = 0.0;
auto recall_value = raft::make_host_scalar(recall_scalar);



Notice we can run invoke the functions for index build and search for both algorithms, one right after the other, because we don’t need to access any outputs from the algorithms in host memory. We will need to synchronize the stream on the raft::device_resources instance before we can read the result of the neighborhood_recall computation, though.

Similar to a Numpy array, when we use a host_scalar, we are really using a multi-dimensional structure that contains only a single dimension, and further a single element. We can use element indexing to access the resulting element directly.

std::cout << recall_value(0) << std::endl;

While it may seem like unnecessary additional work to wrap the result in a host_scalar mdspan, this API choice is made intentionally to support the possibility of also receiving the result as a device_scalar so that it can be used directly on the device for follow-on computations without having to incur the synchronization or transfer cost of bringing the result to host. This pattern becomes even more important when the result is being computed in a loop, such as an iterative solver, and the cost of synchronization and device-to-host (d2h) transfer becomes very expensive.

Advanced features#

The following sections present some advanced features that we have found can be useful for squeezing more utilization out of GPU hardware. As you’ve seen in this tutorial, RAFT provides several very useful tools and building blocks for developing accelerated applications beyond vector search capabilities.


Most of the indexes in raft::neighbors can be serialized to/from streams and files on disk. The index types that support this feature have include files with the naming convention <index_type>_serialize.cuh. The serialization functions are similar across the different index types, with the primary difference being that some index types require a pointer to all the training data for search. Since the original training dataset can be quite large, the serialize() function for these index types includes an argument include_dataset, which allows the user to specify whether the dataset should be included in the serialized form. The index types that allow for this also include a method update_datasets() to allow for the dataset to be re-attached to the index after it is deserialized.

The following example demonstrates serializing and deserializing a CAGRA index to and from a file. For index types that don’t require the training data, you can remove the include_dataset and update_dataset() parts. We will assume the CAGRA index has been built using the code from Step 4 above:

#include <raft/neighbors/cagra.cuh>
#include <raft/neighbors/cagra_serialize.cuh>

using namespace raft::neighbors;

raft::neighbors::cagra::serialize(res, "cagra_serialized.dat", index, false);

auto index_deser = raft::neighbors::cagra::deserialize(res, "cagra_serialized.dat");


As of RAFT 23.10, support for pre-filtering of neighbors has been added to ANN index. This search feature can enable multiple use-cases, such as filtering a vector based on it’s attributes (hybrid searches), the removal of vectors already added to the index, or the control of access in searches for security purposes. The filtering is available through the search_with_filtering() function of the ANN index, and is done by applying a predicate function on the GPU, which usually have the signature (uint32_t query_ix, uint32_t sample_ix) -> bool.

One of the most commonly used mechanism for filtering is the bitset: the bitset is a data structure that allows to test the presence of a value in a set through a fast lookup, and is implemented as a bit array so that every element contains a 0 or a 1 (respectively false and true in boolean logic). RAFT provides a raft::core::bitset class that can be used to create and manipulate bitsets on the GPU, and a raft::core::bitset_view class that can be used to pass bitsets to filtering functions.

The following example demonstrates how to use the filtering API (assume the CAGRA index is built using the code from Step 4 above:

#include <raft/neighbors/cagra.cuh>
#include <raft/neighbors/sample_filter.cuh>

using namespace raft::neighbors;

cagra::search_params search_params;

// create a bitset to filter the search
auto removed_indices = raft::make_device_vector<IdxT>(res, n_removed_indices);
raft::core::bitset<std::uint32_t, IdxT> removed_indices_bitset(
  res, removed_indices.view(), dataset.extent(0));

// ... Populate the bitset ... 

// search K nearest neighbours according to a bitset filter
auto neighbors = raft::make_device_matrix<uint32_t>(res, n_queries, k);
auto distances = raft::make_device_matrix<float>(res, n_queries, k);
cagra::search_with_filtering(res, search_params, index, queries, neighbors, distances,

Stream pools#

Within each CPU thread, CUDA uses streams to submit asynchronous work. You can think of a stream as a queue. Each stream can submit work to the GPU independently of other streams but work submitted within each stream is queued and executed in the order in which it was submitted. Similar to how we can use thread pools to bound the parallelism of CPU threads, we can use CUDA stream pools to bound the amount of concurrent asynchronous work that can be scheduled on a GPU. Each instance of device_resources has a main stream, but can also create a stream pool. For a single CPU thread, multiple different instances of device_resources can be created with different main streams and used to invoke a series of RAFT functions concurrently on the same or different GPU devices, so long as the target devices have available resources to perform the work. Once a device is saturated, queued work on streams will be scheduled and wait for a chance to do more work. During this time the streams are waiting, the CPU thread will still continue its own execution asynchronously unless sync_stream_pool() is called, causing the thread to block and wait for the thread pools to complete.

Also, beware that before splitting GPU work onto multiple different concurrent streams, it can often be important to wait for the main stream in the device_resources. This can be done with wait_stream_pool_on_stream().

To summarize, if wanting to execute multiple different streams in parallel, we would often use a stream pool like this:

#include <raft/core/device_resources.hpp>

#include <rmm/cuda_stream_pool.hpp>
#include <rmm/cuda_stream.hpp>

int n_streams = 5;

rmm::cuda_stream stream;
std::shared_ptr<rmm::cuda_stream_pool> stream_pool(5)
raft::device_resources res(stream.view(), stream_pool);

// Submit some work on the main stream...

for(int i = 0; i < n_streams; ++i) {
    rmm::cuda_stream_view stream_from_pool = res.get_next_usable_stream();
    raft::device_resources pool_res(stream_from_pool);
    // Submit some work with pool_res...


Device resources manager#

In multi-threaded applications, it is often useful to create a set of raft::device_resources objects on startup to avoid the overhead of re-initializing underlying resources every time a raft::device_resources object is needed. To help simplify this common initialization logic, RAFT provides a raft::device_resources_manager to handle this for downstream applications. On startup, the application can specify certain limits on the total resource consumption of the raft::device_resources objects that will be generated:

#include <raft/core/device_resources_manager.hpp>

void initialize_application() {
  // Set the total number of CUDA streams to use on each GPU across all CPU
  // threads. If this method is not called, the default stream per thread
  // will be used.

  // Create a memory pool with given max size in bytes. Passing std::nullopt will allow
  // the pool to grow to the available memory of the device.

  // Set the initial size of the memory pool in bytes.

  // If neither of the above methods are called, no memory pool will be used

While this example shows some commonly used settings, raft::device_resources_manager provides support for several other resource options and constraints, including options to initialize entire stream pools that can be used by an individual raft::device_resources object. After this initialization method is called, the following function could be called from any CPU thread:

void foo() {
  raft::device_resources const& res = raft::device_resources_manager::get_device_resources();
  // Submit some work with res

If any raft::device_resources_manager setters are called after the first call to raft::device_resources_manager::get_device_resources(), these new settings are ignored, and a warning will be logged. If a thread calls raft::device_resources_manager::get_device_resources() multiple times, it is guaranteed to access the same underlying raft::device_resources object every time. This can be useful for chaining work in different calls on the same thread without keeping a persistent reference to the resources object.

Device memory resources#

The RAPIDS software ecosystem makes heavy use of the RAPIDS Memory Manager (RMM) to enable zero-copy sharing of device memory across various GPU-enabled libraries such as PyTorch, Jax, Tensorflow, and FAISS. A really powerful feature of RMM is the ability to set a memory resource, such as a pooled memory resource that allocates a block of memory up front to speed up subsequent smaller allocations, and have all the libraries in the GPU ecosystem recognize and use that same memory resource for all of their memory allocations.

As an example, the following code snippet creates a pool_memory_resource and sets it as the default memory resource, which means all other libraries that use RMM will now allocate their device memory from this same pool:

#include <rmm/mr/device/pool_memory_resource.hpp>

rmm::mr::cuda_memory_resource cuda_mr;
// Construct a resource that uses a coalescing best-fit pool allocator
// set the initial size to half of the free device memory
auto init_size = rmm::percent_of_free_device_memory(50);
rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> pool_mr{&cuda_mr, init_size};
rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device resource pointer to `pool_mr`

The raft::device_resources object will now also use the rmm::current_device_resource. This isn’t limited to C++, however. Often a user will be interacting with PyTorch, RAPIDS, or Tensorflow through Python and so they can set and use RMM’s current_device_resource right in Python.

Workspace memory resource#

As mentioned above, raft::device_resources will use rmm::current_device_resource by default for all memory allocations. However, there are times when a particular algorithm might benefit from using a different memory resource such as a managed_memory_resource, which creates a unified memory space between device and host memory, paging memory in and out of device as needed. Most of RAFT’s algorithms allocate temporary memory as needed to perform their computations and we can control the memory resource used for these temporary allocations through the workspace_resource in the raft::device_resources instance.

For some applications, the managed_memory_resource, can enable a memory space that is larger than the GPU, thus allowing a natural spilling to host memory when needed. This isn’t always the best way to use managed memory, though, as it can quickly lead to thrashing and severely impact performance. Still, when it can be used, it provides a very powerful tool that can also avoid out of memory errors when enough host memory is available.

The following creates a managed memory allocator and set it as the workspace_resource of the raft::device_resources instance:

#include <raft/core/device_resources.hpp>
#include <rmm/mr/device/managed_memory_resource.hpp>

std::shared_ptr<rmm::mr::managed_memory_resource> managed_resource;
raft::device_resource res(managed_resource);

The workspace_resource uses an rmm::mr::limiting_resource_adaptor, which limits the total amount of allocation possible. This allows RAFT algorithms to work within the confines of the memory constraints imposed by the user so that things like batch sizes can be automatically set to reasonable values without exceeding the allotted memory. By default, this limit restricts the memory allocation space for temporary workspace buffers to the memory available on the device.

The below example specifies the total number of bytes that RAFT can use for temporary workspace allocations to 3GB:

#include <raft/core/device_resources.hpp>
#include <rmm/mr/device/managed_memory_resource.hpp>

#include <optional>

std::shared_ptr<rmm::mr::managed_memory_resource> managed_resource;
raft::device_resource res(managed_resource, std::make_optional<std::size_t>(3 * 1024^3));