AgglomerativeClustering#

class cuml.cluster.AgglomerativeClustering(n_clusters=2, *, metric='euclidean', connectivity='knn', linkage='single', c=15, verbose=False, output_type=None)#

Agglomerative Clustering

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Parameters:
n_clustersint, default=2

The number of clusters to find.

metricstr, default=”euclidean”

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, or “cosine”. If connectivity is “knn” only “euclidean” is accepted.

connectivity{“pairwise”, “knn”}, default=”knn”
The type of connectivity matrix to compute.
  • ‘pairwise’ will compute the entire fully-connected graph of pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.

  • ‘knn’ will sparsify the fully-connected connectivity matrix to save memory and enable much larger inputs. You can use c to influence the number of neighbors used.

linkage{“single”}, default=”single”

Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observations. The algorithm will merge the pairs of clusters that minimize this criterion.

  • ‘single’ uses the minimum of the distances between all observations of the two sets.

cint, default=15

Indirectly influences the number of neighbors to use when connectivity="knn", with n_neighbors = log(n_samples) + c. The default of 15 should suffice for most problems.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:
n_clusters_int

The number of clusters found by the algorithm.

labelsarray, shape (n_samples,)

Cluster labels for each point.

n_leaves_int

Number of leaves in the hierarchical tree.

n_connected_components_int

The estimated number of connected components in the graph.

children_array, shape (n_samples - 1, 2)

The children of each non-leave node.

Methods

fit(self, X[, y, convert_dtype])

Fit the hierarchical clustering from features.

fit_predict(self, X[, y])

Fit and return the assigned cluster labels.

fit(self, X, y=None, *, convert_dtype=True) 'AgglomerativeClustering'[source]#

Fit the hierarchical clustering from features.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_predict(self, X, y=None) CumlArray[source]#

Fit and return the assigned cluster labels.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.