DBSCAN#

class cuml.cluster.DBSCAN(*, eps=0.5, min_samples=5, metric='euclidean', algorithm='brute', verbose=False, max_mbytes_per_batch=None, output_type=None, calc_core_sample_indices=True)#

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects an array-like object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters:
epsfloat (default = 0.5)

The maximum distance between 2 points such they reside in the same neighborhood.

min_samplesint (default = 5)

The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).

metric: {‘euclidean’, ‘cosine’, ‘precomputed’}, default = ‘euclidean’

The metric to use when calculating distances between points. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. The input will be modified temporarily when cosine distance is used and the restored input matrix might not match completely due to numerical rounding.

algorithm: {‘brute’, ‘rbc’}, default = ‘brute’

The algorithm to be used by for nearest neighbor computations.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

max_mbytes_per_batch(optional) int64

Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

calc_core_sample_indices(optional) boolean (default = True)

Indicates whether the indices of the core samples should be calculated. If True (the default), core_sample_indices_ and components_ will be computed and stored as fitted attributes. Set to False to avoid computing these attributes, removing a small amount of overhead.

Attributes:
labels_array-like or cuDF series

Which cluster each datapoint belongs to. Noisy samples are labeled as -1. Format depends on cuml global output type and estimator output_type.

core_sample_indices_array-like or cuDF series

The indices of the core samples. Only calculated if calc_core_sample_indices=True.

components_array-like or cuDF series

Copy of each core sample found by training. Only calculated if calc_core_sample_indices=True.

Methods

fit(self, X[, y, sample_weight, out_dtype, ...])

Perform DBSCAN clustering from features.

fit_predict(self, X[, y, sample_weight, ...])

Performs clustering on X and returns cluster labels.

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For additional docs, see scikitlearn’s DBSCAN.

Examples

>>> # Both import methods supported
>>> from cuml import DBSCAN
>>> from cuml.cluster import DBSCAN
>>>
>>> import cudf
>>> import numpy as np
>>>
>>> gdf_float = cudf.DataFrame()
>>> gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
>>> gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>> gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>>
>>> dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
>>> dbscan_float.fit(gdf_float)
DBSCAN()
>>> dbscan_float.labels_
0    0
1    1
2    2
dtype: int32
fit(self, X, y=None, sample_weight=None, *, out_dtype='int32', convert_dtype=True) 'DBSCAN'[source]#

Perform DBSCAN clustering from features.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

e: dtype

Determines the precision of the output labels array. Options are int32 or int64, defaults to int32.

sample_weight: array-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. default: None (which is equivalent to weight 1 for all samples).

fit_predict(self, X, y=None, sample_weight=None, *, out_dtype='int32', convert_dtype=True) CumlArray[source]#

Performs clustering on X and returns cluster labels.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the fit_predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

e: dtype

Determines the precision of the output labels array. Options are int32 or int64, defaults to int32.

sample_weight: array-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. default: None (which is equivalent to weight 1 for all samples).

Returns
——-
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster labels

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.