NearestNeighbors#

class cuml.dask.neighbors.NearestNeighbors(*, client=None, streams_per_handle=0, **kwargs)[source]#

Multi-node Multi-GPU NearestNeighbors Model.

Parameters:
n_neighborsint (default=5)

Default number of neighbors to query

batch_size: int (optional, default 2000000)

Maximum number of query rows processed at once. This parameter can greatly affect the throughput of the algorithm. The optimal setting of this value will vary for different layouts and index to query ratios, but it will require batch_size * n_features * 4 bytes of additional memory on each worker hosting index partitions.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

fit(X)

Fit a multi-node multi-GPU Nearest Neighbors index

get_neighbors(n_neighbors)

Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.

kneighbors([X, n_neighbors, ...])

Query the distributed nearest neighbors index

fit(X)[source]#

Fit a multi-node multi-GPU Nearest Neighbors index

Parameters:
Xdask_cudf.Dataframe
Returns:
self: NearestNeighbors model
get_neighbors(n_neighbors)[source]#

Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.

Parameters:
n_neighborsint

Number of neighbors

Returns:
n_neighbors: int

Default n_neighbors if parameter n_neighbors is none

kneighbors(X=None, n_neighbors=None, return_distance=True, _return_futures=False)[source]#

Query the distributed nearest neighbors index

Parameters:
Xdask_cudf.Dataframe

Vectors to query. If not provided, neighbors of each indexed point are returned.

n_neighborsint

Number of neighbors to query for each row in X. If not provided, the n_neighbors on the model are used.

return_distanceboolean (default=True)

If false, only indices are returned

Returns:
rettuple (dask_cudf.DataFrame, dask_cudf.DataFrame)

First dask-cuDF DataFrame contains distances, second contains the indices.