HDBSCAN#

class cuml.cluster.hdbscan.HDBSCAN(*, min_cluster_size=5, min_samples=None, cluster_selection_epsilon=0.0, max_cluster_size=0, metric='euclidean', alpha=1.0, p=None, cluster_selection_method='eom', allow_single_cluster=False, gen_min_span_tree=False, verbose=False, output_type=None, prediction_data=False, build_algo='brute_force', build_kwds=None, device_ids=None)#

HDBSCAN Clustering

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.

Parameters:
alphafloat, optional (default=1.0)

A distance scaling parameter as used in robust single linkage.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

min_cluster_sizeint, optional (default = 5)

The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.

min_samplesint, optional (default=None)

The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. If ‘None’, it defaults to the min_cluster_size.

cluster_selection_epsilonfloat, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

max_cluster_sizeint, optional (default=0)

A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

metricstring, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. Allowed values: ‘euclidean’.

pint, optional (default=None)

p value to use if using the minkowski metric.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

gen_min_span_treebool, optional (default=False)

Whether to populate the minimum_spanning_tree_ member for utilizing plotting tools. This requires the hdbscan CPU Python package to be installed.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

prediction_databool, optional (default=False)

Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True.

build_algo: string (default=’brute_force’)

How to build the knn graph. Supported build algorithms are [‘brute_force’, ‘nn_descent’]. The ‘nn_descent’ algorithm is typically faster, but may result in a slight accuracy drop compared to ‘brute_force’.

build_kwds: dict (optional, default=None)

Dictionary of parameters to configure the build algorithm. Default values:

  • knn_n_clusters (int, default=1): Number of clusters for data partitioning. Higher values reduce memory usage at the cost of accuracy. When knn_n_clusters > 1, HDBSCAN can process data larger than device memory.

  • knn_overlap_factor (int, default=2): Number of clusters each data point belongs to. Valid only when knn_n_clusters > 1. Must be < ‘knn_n_clusters’.

  • nnd_graph_degree (int, default=64): Graph degree used for NN Descent when build_algo=nn_descent. Must be ≥ min_samples+1.

  • nnd_intermediate_graph_degree (int, default=128): Intermediate graph degree for NN Descent. Must be > nnd_graph_degree.

  • nnd_max_iterations (int, default=20): Max NN Descent iterations when build_algo=nn_descent.

  • nnd_termination_threshold (float, default=0.0001): Stricter threshold leads to better convergence but longer runtime.

Hints:

  • Increasing nnd_graph_degree and nnd_max_iterations may improve accuracy when build_algo=nn_descent.

  • The ratio knn_overlap_factor / knn_n_clusters impacts memory usage. Approximately (knn_overlap_factor / knn_n_clusters) * num_rows_in_entire_data rows will be loaded onto device memory at once. E.g., 2/20 uses less device memory than 2/10.

  • Larger knn_overlap_factor results in better accuracy of the final knn graph. E.g. While using similar amount of device memory, (knn_overlap_factor / knn_n_clusters) = 4/20 will have better accuracy than 2/10 at the cost of performance.

  • Start with knn_overlap_factor = 2 and gradually increase (2->3->4 …) for better accuracy.

  • Start with knn_n_clusters = 4 and increase (4 → 8 → 16…) for less GPU memory usage. This is independent from knn_overlap_factor as long as ‘knn_overlap_factor’ < ‘knn_n_clusters’.

device_idslist[int], “all”, or None, default=None

The device IDs to use during fitting (only used when build_algo=nn_descent and nnd_n_clusters > 1). May be a list of ids, "all" (to use all available devices), or None (to fit using a single GPU only). Default is None.

Attributes:
labels_ndarray, shape (n_samples, )

Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

probabilities_ndarray, shape (n_samples, )

The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.

cluster_persistence_ndarray, shape (n_clusters, )

A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores can be used to gauge the relative coherence of the clusters output by the algorithm.

condensed_tree_CondensedTree object

HDBSCAN.condensed_tree_(self)

single_linkage_tree_SingleLinkageTree object

HDBSCAN.single_linkage_tree_(self)

minimum_spanning_tree_MinimumSpanningTree object

HDBSCAN.minimum_spanning_tree_(self)

Methods

fit(self, X[, y, convert_dtype])

Fit HDBSCAN model from features.

fit_predict(self, X[, y])

Fit the HDBSCAN model from features and return cluster labels.

generate_prediction_data(self)

Create data that caches intermediate results used for predicting the label of new/unseen points.

property condensed_tree_#
fit(self, X, y=None, *, convert_dtype=True) 'HDBSCAN'[source]#

Fit HDBSCAN model from features.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_predict(self, X, y=None) CumlArray[source]#

Fit the HDBSCAN model from features and return cluster labels.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

generate_prediction_data(self)[source]#

Create data that caches intermediate results used for predicting the label of new/unseen points. This data is only useful if you are intending to use functions from hdbscan.prediction.

property minimum_spanning_tree_#
property prediction_data_#
property single_linkage_tree_#