NearestNeighbors#

class cuml.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm='auto', metric='euclidean', p=2, algo_params=None, metric_params=None, n_jobs=None, verbose=False, output_type=None)#

NearestNeighbors is an queries neighborhoods from a given set of datapoints. Currently, cuML supports k-NN queries, which define the neighborhood as the closest k neighbors to each query point.

Parameters:
n_neighborsint (default=5)

Default number of neighbors to query

radiusfloat (default=1.0)

Range of parameter space to use by default for radius_neighbors queries.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

algorithmstring (default=’auto’)

The query algorithm to use. Valid options are:

  • 'auto': to automatically select brute-force or random ball cover based on data shape and metric

  • 'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports Haversine (2d) and Euclidean in 2d and 3d.

  • 'brute': for brute-force, slow but produces exact results

  • 'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only

  • 'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations

metricstring (default=’euclidean’).

Distance metric to use. Supported metrics include: ‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘lp’, ‘chebyshev’, ‘linf’, ‘jensenshannon’, ‘cosine’, ‘braycurtis’, ‘jaccard’, ‘hellinger’, ‘correlation’, ‘inner_product’. The 'ivfflat' and 'ivfpq' algorithms only support: ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘cosine’, ‘correlation’, ‘inner_product’, whereas the 'rbc' algorithm only supports ‘euclidean’, ‘l2’, and ‘haversine’ (≤3 dimensions only). For sparse inputs, only the 'brute' algorithm is supported, with metrics: ‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘canberra’, ‘minkowski’, ‘lp’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘inner_product’, ‘jaccard’, ‘hellinger’.

pfloat (default=2)

Parameter for the Minkowski metric. When p = 1, this is equivalent to manhattan distance (l1), and euclidean distance (l2) for p = 2. For arbitrary p, minkowski distance (lp) is used.

algo_paramsdict, optional (default=None)

Used to configure the nearest neighbor algorithm to be used. If set to None, parameters will be generated automatically. Parameters for algorithm 'brute' when inputs are sparse:

  • batch_size_index : (int) number of rows in each batch of index array

  • batch_size_query : (int) number of rows in each batch of query array

Parameters for algorithm 'ivfflat':

  • nlist: (int) number of cells to partition dataset into

  • nprobe: (int) at query time, number of cells used for search

Parameters for algorithm 'ivfpq':

  • nlist: (int) number of cells to partition dataset into

  • nprobe: (int) at query time, number of cells used for search

  • M: (int) number of subquantizers

  • n_bits: (int) bits allocated per subquantizer

  • usePrecomputedTables : (bool) whether to use precomputed tables

metric_paramsdict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobsint (default = None)

Ignored, here for scikit-learn API compatibility.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Methods

radius_neighbors_graph(self[, X, radius])

Compute the (weighted) graph of neighbors within a radius.

Notes

For an additional example see the NearestNeighbors notebook.

For additional docs, see scikit-learn’s NearestNeighbors.

Pickling NearestNeighbors instances is supported for all algorithms. However, for RBC, IVFPQ or IVFFlat the index will currently be rebuilt upon load rather than serialized as part of the pickled binary. For approximate indices like IVFPQ or IVFFlat this may result in small differences between the original and reloaded models, as the generated indices may differ.

Examples

>>> import cudf
>>> from cuml.neighbors import NearestNeighbors
>>> from cuml.datasets import make_blobs

>>> X, _ = make_blobs(n_samples=5, centers=5,
...                   n_features=10, random_state=42)

>>> # build a cudf Dataframe
>>> X_cudf = cudf.DataFrame(X)

>>> # fit model
>>> model = NearestNeighbors(n_neighbors=3)
>>> model.fit(X)
NearestNeighbors()

>>> # get 3 nearest neighbors
>>> distances, indices = model.kneighbors(X_cudf)

>>> # print results
>>> print(indices)
0  1  2
0  0  3  1
1  1  3  0
2  2  4  0
3  3  0  1
4  4  2  0
>>> print(distances)
        0          1          2
0  0.007812  24.786566  26.399996
1  0.000000  24.786566  30.045017
2  0.007812   5.458400  27.051241
3  0.000000  26.399996  27.543869
4  0.000000   5.458400  29.583437
radius_neighbors_graph(self, X=None, radius=None) SparseCumlArray[source]#

Compute the (weighted) graph of neighbors within a radius.

Parameters:
Xarray-like, default=None

The query point or points. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.

radiusfloat, default=None

Radius of neighborhoods. The default is the value passed to the constructor.

Returns:
Asparse-matrix of shape (n_queries, n_samples_fit)

The neighborhood graph, in CSR format.

Notes

This method is most efficient when the instance is fit with algorithm="rbc". Other algorithms will build a temporary RBC index per-call, which adds a small overhead.

Only euclidean/l2 metrics and dense inputs are currently supported.

Examples

>>> import cupy as cp
>>> from cuml.neighbors import NearestNeighbors
>>> X = cp.array([[0], [3], [1]])
>>> nn = NearestNeighbors().fit(X)
>>> A = nn.radius_neighbors_graph(X, radius=1.5)
>>> A.toarray()
array([[1., 0., 1.],
       [0., 1., 0.],
       [1., 0., 1.]])