Neighbors#
This page provides pylibraft class references for the publicly-exposed elements of the neighbors package.
Brute Force#
- pylibraft.neighbors.brute_force.knn(dataset, queries, k=None, indices=None, distances=None, metric='sqeuclidean', metric_arg=2.0, global_id_offset=0, handle=None)[source]#
Perform a brute-force nearest neighbors search.
- Parameters:
- datasetarray interface compliant matrix, row-major layout,
shape (n_samples, dim). Supported dtype [float]
- queriesarray interface compliant matrix, row-major layout,
shape (n_queries, dim) Supported dtype [float]
- kint
Number of neighbors to search (k <= 2048). Optional if indices or distances arrays are given (in which case their second dimension is k).
- indicesOptional array interface compliant matrix shape
(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)
Supported dtype uint64
- distancesOptional array interface compliant matrix shape
(n_queries, k), dtype float. If supplied, neighbor indices will be written here in-place. (default None)
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- indices: array interface compliant object containing resulting indices
shape (n_queries, k)
- distances: array interface compliant object containing resulting distances
shape (n_queries, k)
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors.brute_force import knn
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 40 >>> distances, neighbors = knn(dataset, queries, k) >>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
CAGRA#
- class pylibraft.neighbors.cagra.IndexParams(metric='sqeuclidean', *, intermediate_graph_degree=128, graph_degree=64, add_data_on_build=True)#
- Attributes:
- add_data_on_build
- graph_degree
- intermediate_graph_degree
- metric
- pylibraft.neighbors.cagra.build(IndexParams index_params, dataset, handle=None)[source]#
Build the CAGRA index from the dataset for efficient search.
The build performs two different steps- first an intermediate knn-graph is constructed, then it’s optimized it to create the final graph. The index_params object controls the node degree of these graphs.
It is required that both the dataset and the optimized graph fit the GPU memory.
The following distance metrics are supported: - L2
- Parameters:
- index_paramsIndexParams object
- datasetCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: cagra.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import cagra
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000 >>> k = 10
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> handle = DeviceResources() >>> build_params = cagra.IndexParams(metric="sqeuclidean")
>>> index = cagra.build(build_params, dataset, handle=handle)
>>> distances, neighbors = cagra.search(cagra.SearchParams(), ... index, dataset, ... k, handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
- class pylibraft.neighbors.cagra.SearchParams(max_queries=0, *, itopk_size=64, max_iterations=0, algo='auto', team_size=0, search_width=1, min_iterations=0, thread_block_size=0, hashmap_mode='auto', hashmap_min_bitlen=0, hashmap_max_fill_rate=0.5, num_random_samplings=1, rand_xor_mask=0x128394)#
- Attributes:
- algo
- hashmap_max_fill_rate
- hashmap_min_bitlen
- hashmap_mode
- itopk_size
- max_iterations
- max_queries
- min_iterations
- num_random_samplings
- rand_xor_mask
- search_width
- team_size
- thread_block_size
- pylibraft.neighbors.cagra.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, handle=None)[source]#
Find the k nearest neighbors for each query.
- Parameters:
- search_paramsSearchParams
- indexIndex
Trained CAGRA index.
- queriesCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- kint
The number of neighbors.
- neighborsOptional CUDA array interface compliant matrix shape
(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)
- distancesOptional CUDA array interface compliant matrix shape
(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import cagra
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> # Build index >>> handle = DeviceResources() >>> index = cagra.build(cagra.IndexParams(), dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> search_params = cagra.SearchParams( ... max_queries=100, ... itopk_size=64 ... )
>>> # Using a pooling allocator reduces overhead of temporary array >>> # creation during search. This is useful if multiple searches >>> # are performad with same query size. >>> distances, neighbors = cagra.search(search_params, index, queries, ... k, handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> neighbors = cp.asarray(neighbors) >>> distances = cp.asarray(distances)
IVF-Flat#
- class pylibraft.neighbors.ivf_flat.IndexParams(n_lists=1024, *, metric=u'sqeuclidean', kmeans_n_iters=20, kmeans_trainset_fraction=0.5, add_data_on_build=True, bool adaptive_centers=False)#
- Attributes:
- adaptive_centers
- add_data_on_build
- kmeans_n_iters
- kmeans_trainset_fraction
- metric
- n_lists
- pylibraft.neighbors.ivf_flat.build(IndexParams index_params, dataset, handle=None)[source]#
Builds an IVF-FLAT index that can be used for nearest neighbor search.
- Parameters:
- index_paramsIndexParams object
- datasetCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: ivf_flat.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> handle = DeviceResources() >>> index_params = ivf_flat.IndexParams( ... n_lists=1024, ... metric="sqeuclidean")
>>> index = ivf_flat.build(index_params, dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> distances, neighbors = ivf_flat.search(ivf_flat.SearchParams(), index, ... queries, k, handle=handle)
>>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
- pylibraft.neighbors.ivf_flat.extend(Index index, new_vectors, new_indices, handle=None)[source]#
Extend an existing index with new vectors.
- Parameters:
- indexivf_flat.Index
Trained ivf_flat object.
- new_vectorsCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- new_indicesCUDA array interface compliant vector shape (n_samples)
Supported dtype [int64]
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: ivf_flat.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> handle = DeviceResources() >>> index = ivf_flat.build(ivf_flat.IndexParams(), dataset, handle=handle)
>>> n_rows = 100 >>> more_data = cp.random.random_sample((n_rows, n_features), ... dtype=cp.float32) >>> indices = index.size + cp.arange(n_rows, dtype=cp.int64) >>> index = ivf_flat.extend(index, more_data, indices)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> distances, neighbors = ivf_flat.search(ivf_flat.SearchParams(), ... index, queries, ... k, handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
- class pylibraft.neighbors.ivf_flat.SearchParams(n_probes=20, *)#
- Attributes:
- n_probes
- pylibraft.neighbors.ivf_flat.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, handle=None)[source]#
Find the k nearest neighbors for each query.
- Parameters:
- search_paramsSearchParams
- indexIndex
Trained IVF-FLAT index.
- queriesCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- kint
The number of neighbors.
- neighborsOptional CUDA array interface compliant matrix shape
(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)
- distancesOptional CUDA array interface compliant matrix shape
(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_flat
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> # Build index >>> handle = DeviceResources() >>> index = ivf_flat.build(ivf_flat.IndexParams(), dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> search_params = ivf_flat.SearchParams( ... n_probes=20 ... ) >>> distances, neighbors = ivf_flat.search(search_params, index, queries, ... k, handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> neighbors = cp.asarray(neighbors) >>> distances = cp.asarray(distances)
IVF-PQ#
- class pylibraft.neighbors.ivf_pq.IndexParams(n_lists=1024, *, metric='sqeuclidean', kmeans_n_iters=20, kmeans_trainset_fraction=0.5, pq_bits=8, pq_dim=0, codebook_kind='subspace', force_random_rotation=False, add_data_on_build=True, conservative_memory_allocation=False)#
- Attributes:
- add_data_on_build
- codebook_kind
- conservative_memory_allocation
- force_random_rotation
- kmeans_n_iters
- kmeans_trainset_fraction
- metric
- n_lists
- pq_bits
- pq_dim
- pylibraft.neighbors.ivf_pq.build(IndexParams index_params, dataset, handle=None)[source]#
Builds an IVF-PQ index that can be later used for nearest neighbor search.
The input array can be either CUDA array interface compliant matrix or array interface compliant matrix in host memory.
- Parameters:
- index_paramsIndexParams object
- datasetarray interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: ivf_pq.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> handle = DeviceResources() >>> index_params = ivf_pq.IndexParams( ... n_lists=1024, ... metric="sqeuclidean", ... pq_dim=10) >>> index = ivf_pq.build(index_params, dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> distances, neighbors = ivf_pq.search(ivf_pq.SearchParams(), index, ... queries, k, handle=handle)
>>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
- pylibraft.neighbors.ivf_pq.extend(Index index, new_vectors, new_indices, handle=None)[source]#
Extend an existing index with new vectors.
The input array can be either CUDA array interface compliant matrix or array interface compliant matrix in host memory.
- Parameters:
- indexivf_pq.Index
Trained ivf_pq object.
- new_vectorsarray interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- new_indicesarray interface compliant vector shape (n_samples)
Supported dtype [int64]
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: ivf_pq.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> handle = DeviceResources() >>> index = ivf_pq.build(ivf_pq.IndexParams(), dataset, handle=handle)
>>> n_rows = 100 >>> more_data = cp.random.random_sample((n_rows, n_features), ... dtype=cp.float32) >>> indices = index.size + cp.arange(n_rows, dtype=cp.int64) >>> index = ivf_pq.extend(index, more_data, indices)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> distances, neighbors = ivf_pq.search(ivf_pq.SearchParams(), ... index, queries, ... k, handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
- class pylibraft.neighbors.ivf_pq.SearchParams(n_probes=20, *, lut_dtype=np.float32, internal_distance_dtype=np.float32)#
- Attributes:
- internal_distance_dtype
- lut_dtype
- n_probes
- pylibraft.neighbors.ivf_pq.search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, DeviceMemoryResource memory_resource=None, handle=None)[source]#
Find the k nearest neighbors for each query.
- Parameters:
- search_paramsSearchParams
- indexIndex
Trained IVF-PQ index.
- queriesCUDA array interface compliant matrix shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- kint
The number of neighbors.
- neighborsOptional CUDA array interface compliant matrix shape
(n_queries, k), dtype int64_t. If supplied, neighbor indices will be written here in-place. (default None)
- distancesOptional CUDA array interface compliant matrix shape
(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)
- memory_resourceRMM DeviceMemoryResource object, optional
This can be used to explicitly manage the temporary memory allocation during search. Passing a pooling allocator can reduce memory allocation overhead. If not specified, then the memory resource from the raft handle is used.
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_pq
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000 >>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> # Build index >>> handle = DeviceResources() >>> index = ivf_pq.build(ivf_pq.IndexParams(), dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 10 >>> search_params = ivf_pq.SearchParams( ... n_probes=20, ... lut_dtype=cp.float16, ... internal_distance_dtype=cp.float32 ... )
>>> # Using a pooling allocator reduces overhead of temporary array >>> # creation during search. This is useful if multiple searches >>> # are performad with same query size. >>> import rmm >>> mr = rmm.mr.PoolMemoryResource( ... rmm.mr.CudaMemoryResource(), ... initial_pool_size=2**29, ... maximum_pool_size=2**31 ... ) >>> distances, neighbors = ivf_pq.search(search_params, index, queries, ... k, memory_resource=mr, ... handle=handle)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()
>>> neighbors = cp.asarray(neighbors) >>> distances = cp.asarray(distances)
Candidate Refinement#
- pylibraft.neighbors.refine(dataset, queries, candidates, k=None, indices=None, distances=None, metric='sqeuclidean', handle=None)[source]#
Refine nearest neighbor search.
Refinement is an operation that follows an approximate NN search. The approximate search has already selected n_candidates neighbor candidates for each query. We narrow it down to k neighbors. For each query, we calculate the exact distance between the query and its n_candidates neighbor candidate, and select the k nearest ones.
Input arrays can be either CUDA array interface compliant matrices or array interface compliant matrices in host memory. All array must be in the same memory space.
- Parameters:
- index_paramsIndexParams object
- datasetarray interface compliant matrix, shape (n_samples, dim)
Supported dtype [float, int8, uint8]
- queriesarray interface compliant matrix, shape (n_queries, dim)
Supported dtype [float, int8, uint8]
- candidatesarray interface compliant matrix, shape (n_queries, k0)
dtype int64
- kint
Number of neighbors to search (k <= k0). Optional if indices or distances arrays are given (in which case their second dimension is k).
- indicesOptional array interface compliant matrix shape
(n_queries, k), dtype int64. If supplied, neighbor indices will be written here in-place. (default None)
Supported dtype int64
- distancesOptional array interface compliant matrix shape
(n_queries, k), dtype float. If supplied, neighbor indices will be written here in-place. (default None)
- handleOptional RAFT resource handle for reusing expensive CUDA
resources. If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- index: ivf_pq.Index
Examples
>>> import cupy as cp
>>> from pylibraft.common import DeviceResources >>> from pylibraft.neighbors import ivf_pq, refine
>>> n_samples = 50000 >>> n_features = 50 >>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> handle = DeviceResources() >>> index_params = ivf_pq.IndexParams(n_lists=1024, metric="sqeuclidean", ... pq_dim=10) >>> index = ivf_pq.build(index_params, dataset, handle=handle)
>>> # Search using the built index >>> queries = cp.random.random_sample((n_queries, n_features), ... dtype=cp.float32) >>> k = 40 >>> _, candidates = ivf_pq.search(ivf_pq.SearchParams(), index, ... queries, k, handle=handle)
>>> k = 10 >>> distances, neighbors = refine(dataset, queries, candidates, k, ... handle=handle) >>> distances = cp.asarray(distances) >>> neighbors = cp.asarray(neighbors)
>>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()