NearestNeighbors#
- class cuml.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm='auto', metric='euclidean', p=2, algo_params=None, metric_params=None, n_jobs=None, verbose=False, output_type=None)#
NearestNeighbors is an queries neighborhoods from a given set of datapoints. Currently, cuML supports k-NN queries, which define the neighborhood as the closest
kneighbors to each query point.- Parameters:
- n_neighborsint (default=5)
Default number of neighbors to query
- radiusfloat (default=1.0)
Range of parameter space to use by default for
radius_neighborsqueries.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- algorithmstring (default=’auto’)
The query algorithm to use. Valid options are:
'auto': to automatically select brute-force or random ball cover based on data shape and metric'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports Haversine (2d) and Euclidean in 2d and 3d.'brute': for brute-force, slow but produces exact results'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations
- metricstring (default=’euclidean’).
Distance metric to use. Supported metrics include: ‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘lp’, ‘chebyshev’, ‘linf’, ‘jensenshannon’, ‘cosine’, ‘braycurtis’, ‘jaccard’, ‘hellinger’, ‘correlation’, ‘inner_product’. The
'ivfflat'and'ivfpq'algorithms only support: ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘cosine’, ‘correlation’, ‘inner_product’, whereas the'rbc'algorithm only supports ‘euclidean’, ‘l2’, and ‘haversine’ (≤3 dimensions only). For sparse inputs, only the'brute'algorithm is supported, with metrics: ‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘canberra’, ‘minkowski’, ‘lp’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘inner_product’, ‘jaccard’, ‘hellinger’.- pfloat (default=2)
Parameter for the Minkowski metric. When p = 1, this is equivalent to manhattan distance (l1), and euclidean distance (l2) for p = 2. For arbitrary p, minkowski distance (lp) is used.
- algo_paramsdict, optional (default=None)
Used to configure the nearest neighbor algorithm to be used. If set to None, parameters will be generated automatically. Parameters for algorithm
'brute'when inputs are sparse:batch_size_index : (int) number of rows in each batch of index array
batch_size_query : (int) number of rows in each batch of query array
Parameters for algorithm
'ivfflat':nlist: (int) number of cells to partition dataset into
nprobe: (int) at query time, number of cells used for search
Parameters for algorithm
'ivfpq':nlist: (int) number of cells to partition dataset into
nprobe: (int) at query time, number of cells used for search
M: (int) number of subquantizers
n_bits: (int) bits allocated per subquantizer
usePrecomputedTables : (bool) whether to use precomputed tables
- metric_paramsdict, optional (default = None)
Additional keyword arguments for the metric function.
- n_jobsint (default = None)
Ignored, here for scikit-learn API compatibility.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
Methods
radius_neighbors_graph(self[, X, radius])Compute the (weighted) graph of neighbors within a radius.
Notes
For an additional example see the NearestNeighbors notebook.
For additional docs, see scikit-learn’s NearestNeighbors.
Pickling
NearestNeighborsinstances is supported for all algorithms. However, for RBC, IVFPQ or IVFFlat the index will currently be rebuilt upon load rather than serialized as part of the pickled binary. For approximate indices like IVFPQ or IVFFlat this may result in small differences between the original and reloaded models, as the generated indices may differ.Examples
>>> import cudf >>> from cuml.neighbors import NearestNeighbors >>> from cuml.datasets import make_blobs >>> X, _ = make_blobs(n_samples=5, centers=5, ... n_features=10, random_state=42) >>> # build a cudf Dataframe >>> X_cudf = cudf.DataFrame(X) >>> # fit model >>> model = NearestNeighbors(n_neighbors=3) >>> model.fit(X) NearestNeighbors() >>> # get 3 nearest neighbors >>> distances, indices = model.kneighbors(X_cudf) >>> # print results >>> print(indices) 0 1 2 0 0 3 1 1 1 3 0 2 2 4 0 3 3 0 1 4 4 2 0 >>> print(distances) 0 1 2 0 0.007812 24.786566 26.399996 1 0.000000 24.786566 30.045017 2 0.007812 5.458400 27.051241 3 0.000000 26.399996 27.543869 4 0.000000 5.458400 29.583437
- radius_neighbors_graph(self, X=None, radius=None) SparseCumlArray[source]#
Compute the (weighted) graph of neighbors within a radius.
- Parameters:
- Xarray-like, default=None
The query point or points. If not provided, neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor.
- radiusfloat, default=None
Radius of neighborhoods. The default is the value passed to the constructor.
- Returns:
- Asparse-matrix of shape (n_queries, n_samples_fit)
The neighborhood graph, in CSR format.
Notes
This method is most efficient when the instance is fit with
algorithm="rbc". Other algorithms will build a temporary RBC index per-call, which adds a small overhead.Only euclidean/l2 metrics and dense inputs are currently supported.
Examples
>>> import cupy as cp >>> from cuml.neighbors import NearestNeighbors >>> X = cp.array([[0], [3], [1]]) >>> nn = NearestNeighbors().fit(X) >>> A = nn.radius_neighbors_graph(X, radius=1.5) >>> A.toarray() array([[1., 0., 1.], [0., 1., 0.], [1., 0., 1.]])