DBSCAN#
- class cuml.dask.cluster.DBSCAN(*, client=None, verbose=False, **kwargs)[source]#
Multi-Node Multi-GPU implementation of DBSCAN.
The whole dataset is copied to all the workers but the work is then divided by giving “ownership” of a subset to each worker: each worker computes a clustering by considering the relationships between those points and the rest of the dataset, and partial results are merged at the end to obtain the final clustering.
- Parameters:
- clientdask.distributed.Client
Dask client to use
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- min_samplesint (default = 5)
The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
- max_mbytes_per_batch(optional) int64
Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.- calc_core_sample_indices(optional) boolean (default = True)
Indicates whether the indices of the core samples should be calculated. The the attribute
core_sample_indices_will not be used, setting this to False will avoid unnecessary kernel launches
Methods
fit(X[, out_dtype])Fit a multi-node multi-GPU DBSCAN model
fit_predict(X[, out_dtype])Performs clustering on X and returns cluster labels.
Notes
For additional docs, see the documentation of the single-GPU DBSCAN model
- fit(X, out_dtype='int32')[source]#
Fit a multi-node multi-GPU DBSCAN model
- Parameters:
- Xarray-like (device or host)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- out_dtype: dtype Determines the precision of the output labels array.
default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.
- fit_predict(X, out_dtype='int32')[source]#
Performs clustering on X and returns cluster labels.
- Parameters:
- Xarray-like (device or host)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- out_dtype: dtype Determines the precision of the output labels array.
default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.
- Returns
- ——-
- labels: array-like (device or host)
Integer array of labels