AgglomerativeClustering#
- class cuml.cluster.AgglomerativeClustering(n_clusters=2, *, metric='euclidean', connectivity='knn', linkage='single', c=15, verbose=False, output_type=None)#
Agglomerative Clustering
Recursively merges the pair of clusters that minimally increases a given linkage distance.
- Parameters:
- n_clustersint, default=2
The number of clusters to find.
- metricstr, default=”euclidean”
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, or “cosine”. If connectivity is “knn” only “euclidean” is accepted.
- connectivity{“pairwise”, “knn”}, default=”knn”
- The type of connectivity matrix to compute.
‘pairwise’ will compute the entire fully-connected graph of pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.
‘knn’ will sparsify the fully-connected connectivity matrix to save memory and enable much larger inputs. You can use
cto influence the number of neighbors used.
- linkage{“single”}, default=”single”
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observations. The algorithm will merge the pairs of clusters that minimize this criterion.
‘single’ uses the minimum of the distances between all observations of the two sets.
- cint, default=15
Indirectly influences the number of neighbors to use when
connectivity="knn", withn_neighbors = log(n_samples) + c. The default of 15 should suffice for most problems.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- n_clusters_int
The number of clusters found by the algorithm.
- labelsarray, shape (n_samples,)
Cluster labels for each point.
- n_leaves_int
Number of leaves in the hierarchical tree.
- n_connected_components_int
The estimated number of connected components in the graph.
- children_array, shape (n_samples - 1, 2)
The children of each non-leave node.
Methods
fit(self, X[, y, convert_dtype])Fit the hierarchical clustering from features.
fit_predict(self, X[, y])Fit and return the assigned cluster labels.
- fit(self, X, y=None, *, convert_dtype=True) 'AgglomerativeClustering'[source]#
Fit the hierarchical clustering from features.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- fit_predict(self, X, y=None) CumlArray[source]#
Fit and return the assigned cluster labels.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns:
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.