RAFT ANN Benchmarks#
This project provides a benchmark program for various ANN search implementations. It’s especially suitable for comparing GPU implementations as well as comparing GPU against CPU.
Installing the benchmarks#
The easiest way to install these benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time:
mamba create --name raft_ann_benchmarks
conda activate raft_ann_benchmarks
# to install GPU package:
mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench cuda-version=11.8*
# to install CPU package for usage in CPU-only systems:
mamba install -c rapidsai -c conda-forge raft-ann-bench-cpu
The channel rapidsai
can easily be substituted rapidsai-nightly
if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
Please see the build instructions to build the benchmarks from source.
Running the benchmarks#
Usage#
There are 4 general steps to running the benchmarks and visualizing the results:
Prepare Dataset
Build Index and Search Index
Data Export
Plot Results
We provide a collection of lightweight Python scripts that are wrappers over lower level scripts and executables to run our benchmarks. Either Python scripts or low-level scripts and executables are valid methods to run benchmarks, however plots are only provided through our Python scripts.
End-to-end example: Million-scale#
The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the RAPIDS_DATASET_ROOT_DIR
environment variable if defined, otherwise a datasets sub-folder from where the script is being called:
# (1) prepare dataset.
python -m raft-ann-bench.get_dataset --dataset deep-image-96-angular --normalize
# (2) build and search index
python -m raft-ann-bench.run --dataset deep-image-96-inner
# (3) export data
python -m raft-ann-bench.data_export --dataset deep-image-96-inner
# (4) plot results
python -m raft-ann-bench.plot --dataset deep-image-96-inner
Configuration files already exist for the following list of the million-scale datasets. Please refer to ann-benchmarks datasets for more information, including actual train and sizes. These all work out-of-the-box with the --dataset
argument. Other million-scale datasets from ann-benchmarks.com
will work, but will require a json configuration file to be created in $CONDA_PREFIX/lib/python3.xx/site-packages/raft-ann-bench/run/conf
, or you can specify the --configuration
option to use a specific file.
deep-image-96-angular
fashion-mnist-784-euclidean
glove-50-angular
glove-100-angular
lastfm-65-angular
mnist-784-euclidean
nytimes-256-angular
sift-128-euclidean
End-to-end example: Billion-scale#
raft-ann-bench.get_dataset
cannot be used to download the billion-scale datasets
because they are so large. You should instead use our billion-scale datasets guide to download and prepare them.
All other python mentioned below work as intended once the
billion-scale dataset has been downloaded.
To download Billion-scale datasets, visit big-ann-benchmarks
The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100.
mkdir -p datasets/deep-1B
# (1) prepare dataset
# download manually "Ground Truth" file of "Yandex DEEP"
# suppose the file name is deep_new_groundtruth.public.10K.bin
python -m raft-ann-bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
# (2) build and search index
python -m raft-ann-bench.run --dataset deep-1B
# (3) export data
python -m raft-ann-bench.data_export --dataset deep-1B
# (4) plot results
python -m raft-ann-bench.plot --dataset deep-1B
The usage of python -m raft-ann-bench.split_groundtruth
is:
usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH
options:
-h, --help show this help message and exit
--groundtruth GROUNDTRUTH
Path to billion-scale dataset groundtruth file (default: None)
Step 1: Prepare Dataset#
The script raft-ann-bench.get_dataset
will download and unpack the dataset in directory
that the user provides. As of now, only million-scale datasets are supported by this
script. For more information on datasets and formats.
The usage of this script is:
usage: get_dataset.py [-h] [--name NAME] [--dataset-path DATASET_PATH] [--normalize]
options:
-h, --help show this help message and exit
--dataset DATASET dataset to download (default: glove-100-angular)
--dataset-path DATASET_PATH
path to download dataset (default: ${RAPIDS_DATASET_ROOT_DIR})
--normalize normalize cosine distance to inner product (default: False)
When option normalize
is provided to the script, any dataset that has cosine distances
will be normalized to inner product. So, for example, the dataset glove-100-angular
will be written at location datasets/glove-100-inner/
.
Step 2: Build and Search Index#
The script raft-ann-bench.run
will build and search indices for a given dataset and its
specified configuration.
To confirgure which algorithms are available, we use algos.yaml
.
To configure building/searching indices for a dataset, look at index configuration.
An entry in algos.yaml
looks like:
raft_ivf_pq:
executable: RAFT_IVF_PQ_ANN_BENCH
requires_gpu: true
executable
: specifies the name of the binary that will build/search the index. It is assumed to be
available in raft/cpp/build/
.
requires_gpu
: denotes whether an algorithm requires GPU to run.
The usage of the script raft-ann-bench.run
is:
usage: run.py [-h] [-k COUNT] [-bs BATCH_SIZE] [--configuration CONFIGURATION] [--dataset DATASET] [--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--indices INDICES]
[-f]
options:
-h, --help show this help message and exit
-k COUNT, --count COUNT
the number of nearest neighbors to search for (default: 10)
-bs BATCH_SIZE, --batch-size BATCH_SIZE
number of query vectors to use in each query trial (default: 10000)
--configuration CONFIGURATION
path to configuration file for a dataset (default: None)
--dataset DATASET dataset whose configuration file will be used (default: glove-100-inner)
--dataset-path DATASET_PATH
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
--build
--search
--algorithms ALGORITHMS
run only comma separated list of named algorithms (default: None)
--indices INDICES run only comma separated list of named indices. parameter `algorithms` is ignored (default: None)
-f, --force re-run algorithms even if their results already exist (default: False)
configuration
and dataset
: configuration
is a path to a configuration file for a given dataset.
The configuration file should be name as <dataset>.json
. It is optional if the name of the dataset is
provided with the dataset
argument, in which case
a configuration file will be searched for as python/raft-ann-bench/src/raft-ann-bench/run/conf/<dataset>.json
.
For every algorithm run by this script, it outputs an index build statistics JSON file in <dataset-path/<dataset>/result/build/<algo-k{k}-batch_size{batch_size}.json>
and an index search statistics JSON file in <dataset-path/<dataset>/result/search/<algo-k{k}-batch_size{batch_size}.json>
.
dataset-path
:
data is read from
<dataset-path>/<dataset>
indices are built in
<dataset-path>/<dataset>/index
build/search results are stored in
<dataset-path>/<dataset>/result
build
and search
: if both parameters are not supplied to the script then
it is assumed both are True
.
indices
and algorithms
: these parameters ensure that the algorithm specified for an index
is available in algos.yaml
and not disabled, as well as having an associated executable.
Step 3: Data Export#
The script raft-ann-bench.data_export
will convert the intermediate JSON outputs produced by raft-ann-bench.run
to more
easily readable CSV files, which are needed to build charts made by raft-ann-bench.plot
.
usage: data_export.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH]
options:
-h, --help show this help message and exit
--dataset DATASET dataset to download (default: glove-100-inner)
--dataset-path DATASET_PATH
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
Build statistics CSV file is stored in <dataset-path/<dataset>/result/build/<algo-k{k}-batch_size{batch_size}.csv>
and index search statistics CSV file in <dataset-path/<dataset>/result/search/<algo-k{k}-batch_size{batch_size}.csv>
.
Step 4: Plot Results#
The script raft-ann-bench.plot
will plot results for all algorithms found in index search statistics
CSV file in <dataset-path/<dataset>/result/search/<-k{k}-batch_size{batch_size}>.csv
.
The usage of this script is:
usage: plot.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [-k COUNT] [-bs BATCH_SIZE] [--build] [--search]
[--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--raw]
options:
-h, --help show this help message and exit
--dataset DATASET dataset to download (default: glove-100-inner)
--dataset-path DATASET_PATH
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
--output-filepath OUTPUT_FILEPATH
directory for PNG to be saved (default: os.getcwd())
--algorithms ALGORITHMS
plot only comma separated list of named algorithms (default: None)
-k COUNT, --count COUNT
the number of nearest neighbors to search for (default: 10)
-bs BATCH_SIZE, --batch-size BATCH_SIZE
number of query vectors to use in each query trial (default: 10000)
--build
--search
--x-scale X_SCALE Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear)
--y-scale {linear,log,symlog,logit}
Scale to use when drawing the Y-axis (default: linear)
--raw Show raw results (not just Pareto frontier) in faded colours (default: False)
The figure below is the resulting plot of running our benchmarks as of August 2023 for a batch size of 10, on an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU. It presents the throughput (in Queries-Per-Second) performance for every level of recall.
Creating and customizing dataset configurations#
A single configuration file will often define a set of algorithms, with associated index and search parameters, for a specific dataset. A configuration file uses json format with 4 major parts:
Dataset information
Algorithm information
Index parameters
Search parameters
Below is a simple example configuration file for the 1M-scale sift-128-euclidean
dataset:
{
"dataset": {
"name": "sift-128-euclidean",
"base_file": "sift-128-euclidean/base.fbin",
"query_file": "sift-128-euclidean/query.fbin",
"subset_size": 1000000,
"groundtruth_neighbors_file": "sift-128-euclidean/groundtruth.neighbors.ibin",
"distance": "euclidean"
},
"index": []
}
The index
section will contain a list of index objects, each of which will have the following form:
{
"name": "algo_name.unique_index_name",
"algo": "algo_name",
"file": "sift-128-euclidean/algo_name/param1_val1-param2_val2",
"build_param": { "param1": "val1", "param2": "val2" },
"search_params": [{ "search_param1": "search_val1" }]
}
The table below contains the possible settings for the algo
field. Each unique algorithm will have its own set of build_param
and search_params
settings. The ANN Algorithm Parameter Tuning Guide contains detailed instructions on choosing build and search parameters for each supported algorithm.
Library | Algorithms |
---|---|
FAISS | faiss_gpu_ivf_flat , faiss_gpu_ivf_pq |
GGNN | ggnn |
HNSWlib | hnswlib |
RAFT | raft_cagra , raft_ivf_flat , raft_ivf_pq |
By default, the index will be placed in bench/ann/data/<dataset_name>/index/<name>
. Using sift-128-euclidean
for the dataset with the algo
example above, the indexes would be placed in bench/ann/data/sift-128-euclidean/index/algo_name/param1_val1-param2_val2
.
Adding a new ANN algorithm#
Implementation and Configuration#
Implementation of a new algorithm should be a C++ class that inherits class ANN
(defined in cpp/bench/ann/src/ann.h
) and implements all the pure virtual functions.
In addition, it should define two struct
s for building and searching parameters. The searching parameter class should inherit struct ANN<T>::AnnSearchParam
. Take class HnswLib
as an example, its definition is:
template<typename T>
class HnswLib : public ANN<T> {
public:
struct BuildParam {
int M;
int ef_construction;
int num_threads;
};
using typename ANN<T>::AnnSearchParam;
struct SearchParam : public AnnSearchParam {
int ef;
int num_threads;
};
// ...
};
The benchmark program uses JSON format in a configuration file to specify indexes to build, along with the build and search parameters. To add the new algorithm to the benchmark, need be able to specify build_param
, whose value is a JSON object, and search_params
, whose value is an array of JSON objects, for this algorithm in configuration file. The build_param
and search_param
arguments will vary depending on the algorithm. Take the configuration for HnswLib
as an example:
{
"name" : "hnswlib.M12.ef500.th32",
"algo" : "hnswlib",
"build_param": {"M":12, "efConstruction":500, "numThreads":32},
"file" : "/path/to/file",
"search_params" : [
{"ef":10, "numThreads":1},
{"ef":20, "numThreads":1},
{"ef":40, "numThreads":1},
],
"search_result_file" : "/path/to/file"
},
How to interpret these JSON objects is totally left to the implementation and should be specified in cpp/bench/ann/src/factory.cuh
:
First, add two functions for parsing JSON object to
struct BuildParam
andstruct SearchParam
, respectively:template<typename T> void parse_build_param(const nlohmann::json& conf, typename cuann::HnswLib<T>::BuildParam& param) { param.ef_construction = conf.at("efConstruction"); param.M = conf.at("M"); if (conf.contains("numThreads")) { param.num_threads = conf.at("numThreads"); } } template<typename T> void parse_search_param(const nlohmann::json& conf, typename cuann::HnswLib<T>::SearchParam& param) { param.ef = conf.at("ef"); if (conf.contains("numThreads")) { param.num_threads = conf.at("numThreads"); } }
Next, add corresponding
if
case to functionscreate_algo()
(incpp/bench/ann/) and
create_search_param()by calling parsing functions. The string literal in
ifcondition statement must be the same as the value of
algo` in configuration file. For example,// JSON configuration file contains a line like: "algo" : "hnswlib" if (algo == "hnswlib") { // ... }
Adding a CMake Target#
In raft/cpp/bench/ann/CMakeLists.txt
, we provide a CMake
function to configure a new Benchmark target with the following signature:
ConfigureAnnBench(
NAME <algo_name>
PATH </path/to/algo/benchmark/source/file>
INCLUDES <additional_include_directories>
CXXFLAGS <additional_cxx_flags>
LINKS <additional_link_library_targets>
)
To add a target for HNSWLIB
, we would call the function as:
ConfigureAnnBench(
NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES
${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}"
)
This will create an executable called HNSWLIB_ANN_BENCH
, which can then be used to run HNSWLIB
benchmarks.