Neighbors#

This page provides pylibraft class references for the publicly-exposed elements of the neighbors package.

Brute Force#

pylibraft.neighbors.brute_force.knn(dataset, queries, k=None, indices=None, distances=None, metric='sqeuclidean', metric_arg=2.0, global_id_offset=0, handle=None)[source]#

Perform a brute-force nearest neighbors search.

Parameters:
datasetarray interface compliant matrix, row-major layout,

shape (n_samples, dim). Supported dtype [float]

queriesarray interface compliant matrix, row-major layout,

shape (n_queries, dim) Supported dtype [float]

kint

Number of neighbors to search (k <= 2048). Optional if indices or distances arrays are given (in which case their second dimension is k).

indicesOptional array interface compliant matrix shape

(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)

Supported dtype uint64

distancesOptional array interface compliant matrix shape

(n_queries, k), dtype float. If supplied, neighbor indices will be written here in-place. (default None)

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
indices: array interface compliant object containing resulting indices

shape (n_queries, k)

distances: array interface compliant object containing resulting distances

shape (n_queries, k)

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors.brute_force import knn
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 40
>>> distances, neighbors = knn(dataset, queries, k)
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)

CAGRA#

class pylibraft.neighbors.cagra.IndexParams(metric='sqeuclidean', *, intermediate_graph_degree=128, graph_degree=64, add_data_on_build=True)#
Attributes:
add_data_on_build
graph_degree
intermediate_graph_degree
metric
pylibraft.neighbors.cagra.build(IndexParams index_params, dataset, handle=None)[source]#

Build the CAGRA index from the dataset for efficient search.

The build performs two different steps- first an intermediate knn-graph is constructed, then it’s optimized it to create the final graph. The index_params object controls the node degree of these graphs.

It is required that both the dataset and the optimized graph fit the GPU memory.

The following distance metrics are supported: - L2

Parameters:
index_paramsIndexParams object
datasetCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: cagra.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import cagra
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> k = 10
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> build_params = cagra.IndexParams(metric="sqeuclidean")
>>> index = cagra.build(build_params, dataset, handle=handle)
>>> distances, neighbors = cagra.search(cagra.SearchParams(),
...                                      index, dataset,
...                                      k, handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
class pylibraft.neighbors.cagra.SearchParams(max_queries=0, *, itopk_size=64, max_iterations=0, algo='auto', team_size=0, search_width=1, min_iterations=0, thread_block_size=0, hashmap_mode='auto', hashmap_min_bitlen=0, hashmap_max_fill_rate=0.5, num_random_samplings=1, rand_xor_mask=0x128394)#
Attributes:
algo
hashmap_max_fill_rate
hashmap_min_bitlen
hashmap_mode
itopk_size
max_iterations
max_queries
min_iterations
num_random_samplings
rand_xor_mask
search_width
team_size
thread_block_size
pylibraft.neighbors.cagra.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, handle=None)[source]#

Find the k nearest neighbors for each query.

Parameters:
search_paramsSearchParams
indexIndex

Trained CAGRA index.

queriesCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

kint

The number of neighbors.

neighborsOptional CUDA array interface compliant matrix shape

(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)

distancesOptional CUDA array interface compliant matrix shape

(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import cagra
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> handle = DeviceResources()
>>> index = cagra.build(cagra.IndexParams(), dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> search_params = cagra.SearchParams(
...     max_queries=100,
...     itopk_size=64
... )
>>> # Using a pooling allocator reduces overhead of temporary array
>>> # creation during search. This is useful if multiple searches
>>> # are performad with same query size.
>>> distances, neighbors = cagra.search(search_params, index, queries,
...                                     k, handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> neighbors = cp.asarray(neighbors)
>>> distances = cp.asarray(distances)

IVF-Flat#

class pylibraft.neighbors.ivf_flat.IndexParams(n_lists=1024, *, metric=u'sqeuclidean', kmeans_n_iters=20, kmeans_trainset_fraction=0.5, add_data_on_build=True, bool adaptive_centers=False)#
Attributes:
adaptive_centers
add_data_on_build
kmeans_n_iters
kmeans_trainset_fraction
metric
n_lists
pylibraft.neighbors.ivf_flat.build(IndexParams index_params, dataset, handle=None)[source]#

Builds an IVF-FLAT index that can be used for nearest neighbor search.

Parameters:
index_paramsIndexParams object
datasetCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: ivf_flat.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> index_params = ivf_flat.IndexParams(
...     n_lists=1024,
...     metric="sqeuclidean")
>>> index = ivf_flat.build(index_params, dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> distances, neighbors = ivf_flat.search(ivf_flat.SearchParams(), index,
...                                      queries, k, handle=handle)
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
pylibraft.neighbors.ivf_flat.extend(Index index, new_vectors, new_indices, handle=None)[source]#

Extend an existing index with new vectors.

Parameters:
indexivf_flat.Index

Trained ivf_flat object.

new_vectorsCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

new_indicesCUDA array interface compliant vector shape (n_samples)

Supported dtype [int64]

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: ivf_flat.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> index = ivf_flat.build(ivf_flat.IndexParams(), dataset, handle=handle)
>>> n_rows = 100
>>> more_data = cp.random.random_sample((n_rows, n_features),
...                                     dtype=cp.float32)
>>> indices = index.size + cp.arange(n_rows, dtype=cp.int64)
>>> index = ivf_flat.extend(index, more_data, indices)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> distances, neighbors = ivf_flat.search(ivf_flat.SearchParams(),
...                                      index, queries,
...                                      k, handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
class pylibraft.neighbors.ivf_flat.SearchParams(n_probes=20, *)#
Attributes:
n_probes
pylibraft.neighbors.ivf_flat.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, handle=None)[source]#

Find the k nearest neighbors for each query.

Parameters:
search_paramsSearchParams
indexIndex

Trained IVF-FLAT index.

queriesCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

kint

The number of neighbors.

neighborsOptional CUDA array interface compliant matrix shape

(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)

distancesOptional CUDA array interface compliant matrix shape

(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> handle = DeviceResources()
>>> index = ivf_flat.build(ivf_flat.IndexParams(), dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> search_params = ivf_flat.SearchParams(
...     n_probes=20
... )
>>> distances, neighbors = ivf_flat.search(search_params, index, queries,
...                                      k, handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> neighbors = cp.asarray(neighbors)
>>> distances = cp.asarray(distances)

IVF-PQ#

class pylibraft.neighbors.ivf_pq.IndexParams(n_lists=1024, *, metric='sqeuclidean', kmeans_n_iters=20, kmeans_trainset_fraction=0.5, pq_bits=8, pq_dim=0, codebook_kind='subspace', force_random_rotation=False, add_data_on_build=True, conservative_memory_allocation=False)#
Attributes:
add_data_on_build
codebook_kind
conservative_memory_allocation
force_random_rotation
kmeans_n_iters
kmeans_trainset_fraction
metric
n_lists
pq_bits
pq_dim
pylibraft.neighbors.ivf_pq.build(IndexParams index_params, dataset, handle=None)[source]#

Builds an IVF-PQ index that can be later used for nearest neighbor search.

The input array can be either CUDA array interface compliant matrix or array interface compliant matrix in host memory.

Parameters:
index_paramsIndexParams object
datasetarray interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: ivf_pq.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> index_params = ivf_pq.IndexParams(
...     n_lists=1024,
...     metric="sqeuclidean",
...     pq_dim=10)
>>> index = ivf_pq.build(index_params, dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> distances, neighbors = ivf_pq.search(ivf_pq.SearchParams(), index,
...                                      queries, k, handle=handle)
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
pylibraft.neighbors.ivf_pq.extend(Index index, new_vectors, new_indices, handle=None)[source]#

Extend an existing index with new vectors.

The input array can be either CUDA array interface compliant matrix or array interface compliant matrix in host memory.

Parameters:
indexivf_pq.Index

Trained ivf_pq object.

new_vectorsarray interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

new_indicesarray interface compliant vector shape (n_samples)

Supported dtype [int64]

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: ivf_pq.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> index = ivf_pq.build(ivf_pq.IndexParams(), dataset, handle=handle)
>>> n_rows = 100
>>> more_data = cp.random.random_sample((n_rows, n_features),
...                                     dtype=cp.float32)
>>> indices = index.size + cp.arange(n_rows, dtype=cp.int64)
>>> index = ivf_pq.extend(index, more_data, indices)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> distances, neighbors = ivf_pq.search(ivf_pq.SearchParams(),
...                                      index, queries,
...                                      k, handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
class pylibraft.neighbors.ivf_pq.SearchParams(n_probes=20, *, lut_dtype=np.float32, internal_distance_dtype=np.float32)#
Attributes:
internal_distance_dtype
lut_dtype
n_probes
pylibraft.neighbors.ivf_pq.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, DeviceMemoryResource memory_resource=None, handle=None)[source]#

Find the k nearest neighbors for each query.

Parameters:
search_paramsSearchParams
indexIndex

Trained IVF-PQ index.

queriesCUDA array interface compliant matrix shape (n_samples, dim)

Supported dtype [float, int8, uint8]

kint

The number of neighbors.

neighborsOptional CUDA array interface compliant matrix shape

(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)

distancesOptional CUDA array interface compliant matrix shape

(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)

memory_resourceRMM DeviceMemoryResource object, optional

This can be used to explicitly manage the temporary memory allocation during search. Passing a pooling allocator can reduce memory allocation overhead. If not specified, then the memory resource from the raft handle is used.

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> handle = DeviceResources()
>>> index = ivf_pq.build(ivf_pq.IndexParams(), dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> search_params = ivf_pq.SearchParams(
...     n_probes=20,
...     lut_dtype=cp.float16,
...     internal_distance_dtype=cp.float32
... )
>>> # Using a pooling allocator reduces overhead of temporary array
>>> # creation during search. This is useful if multiple searches
>>> # are performad with same query size.
>>> import rmm
>>> mr = rmm.mr.PoolMemoryResource(
...     rmm.mr.CudaMemoryResource(),
...     initial_pool_size=2**29,
...     maximum_pool_size=2**31
... )
>>> distances, neighbors = ivf_pq.search(search_params, index, queries,
...                                      k, memory_resource=mr,
...                                      handle=handle)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()
>>> neighbors = cp.asarray(neighbors)
>>> distances = cp.asarray(distances)

Candidate Refinement#

pylibraft.neighbors.refine(dataset, queries, candidates, k=None, indices=None, distances=None, metric='sqeuclidean', handle=None)[source]#

Refine nearest neighbor search.

Refinement is an operation that follows an approximate NN search. The approximate search has already selected n_candidates neighbor candidates for each query. We narrow it down to k neighbors. For each query, we calculate the exact distance between the query and its n_candidates neighbor candidate, and select the k nearest ones.

Input arrays can be either CUDA array interface compliant matrices or array interface compliant matrices in host memory. All array must be in the same memory space.

Parameters:
index_paramsIndexParams object
datasetarray interface compliant matrix, shape (n_samples, dim)

Supported dtype [float, int8, uint8]

queriesarray interface compliant matrix, shape (n_queries, dim)

Supported dtype [float, int8, uint8]

candidatesarray interface compliant matrix, shape (n_queries, k0)

dtype int64

kint

Number of neighbors to search (k <= k0). Optional if indices or distances arrays are given (in which case their second dimension is k).

indicesOptional array interface compliant matrix shape

(n_queries, k), dtype int64. If supplied, neighbor indices will be written here in-place. (default None)

Supported dtype int64

distancesOptional array interface compliant matrix shape

(n_queries, k), dtype float. If supplied, neighbor indices will be written here in-place. (default None)

handleOptional RAFT resource handle for reusing expensive CUDA

resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
index: ivf_pq.Index

Examples

>>> import cupy as cp
>>> from pylibraft.common import DeviceResources
>>> from pylibraft.neighbors import ivf_pq, refine
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> handle = DeviceResources()
>>> index_params = ivf_pq.IndexParams(n_lists=1024, metric="sqeuclidean",
...                                   pq_dim=10)
>>> index = ivf_pq.build(index_params, dataset, handle=handle)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 40
>>> _, candidates = ivf_pq.search(ivf_pq.SearchParams(), index,
...                               queries, k, handle=handle)
>>> k = 10
>>> distances, neighbors = refine(dataset, queries, candidates, k,
...                               handle=handle)
>>> distances = cp.asarray(distances)
>>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()