KMeans#
- class cuml.cluster.KMeans(*, n_clusters=8, max_iter=300, tol=0.0001, verbose=False, random_state=None, init='scalable-k-means++', n_init='auto', oversampling_factor=2.0, max_samples_per_batch=32768, output_type=None)#
KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.
cuML’s KMeans expects an array-like object or cuDF DataFrame, and supports the scalable KMeans++ initialization method. This method is more stable than randomly selecting K points.
- Parameters:
- n_clustersint (default = 8)
The number of centroids or clusters you want.
- max_iterint (default = 300)
The more iterations of EM, the more accurate, but slower.
- tolfloat64 (default = 1e-4)
Stopping criterion when centroid means do not change much.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- random_stateint or None (default = None)
If you want results to be the same when you restart Python, select a state.
- init{‘scalable-k-means++’, ‘k-means||’, ‘k-means++’, ‘random’} or an ndarray (default = ‘scalable-k-means++’)
'scalable-k-means++'or'k-means||': Uses fast and stable scalable kmeans++ initialization. k-means++ is the constrained case of k-means|| withoversampling_factor=0'random': Choosen_clusterobservations (rows) at random from data for the initial centroids.If an ndarray is passed, it should be of shape (
n_clusters,n_features) and gives the initial centers.
- n_init: ‘auto’ or int (default = ‘auto’)
Number of instances the k-means algorithm will be called with different seeds. The final results will be from the instance that produces lowest inertia out of n_init instances.
When
n_init='auto', the number of runs depends on the value ofinit: 1 if usinginit='"k-means||"orinit="scalable-k-means++"; 10 otherwise.Added in version 25.02: Added ‘auto’ option for
n_init.Changed in version 25.04: Default value for
n_initwill change from 1 to'auto'in version 25.04.- oversampling_factorfloat64 (default = 2.0)
The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.
- max_samples_per_batchint (default = 32768)
The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is
max_samples_per_batch * n_clusters. It might become necessary to lower this number whenn_clustersbecomes prohibitively large.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- cluster_centers_array
The coordinates of the final clusters. This represents of “mean” of each data cluster.
- labels_array
Which cluster each datapoint belongs to.
Methods
fit(self, X[, y, sample_weight, convert_dtype])Compute k-means clustering with X.
fit_predict(self, X[, y, sample_weight])Compute cluster centers and predict cluster index for each sample.
fit_transform(self, X[, y, sample_weight, ...])Compute clustering and transform X to cluster-distance space.
predict(self, X, *[, convert_dtype])Predict the closest cluster each sample in X belongs to.
score(self, X[, y, sample_weight, convert_dtype])Opposite of the value of X on the K-means objective.
transform(self, X, *[, convert_dtype])Transform X to a cluster-distance space.
Notes
KMeans requires
n_clustersto be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.Applications of KMeans
The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.
For additional docs, see scikitlearn’s Kmeans.
Examples
>>> # Both import methods supported >>> from cuml import KMeans >>> from cuml.cluster import KMeans >>> import cudf >>> import numpy as np >>> import pandas as pd >>> >>> a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]], ... dtype=np.float32) >>> b = cudf.DataFrame(a) >>> # Input: >>> b 0 1 0 1.0 1.0 1 1.0 2.0 2 3.0 2.0 3 4.0 3.0 >>> >>> # Calling fit >>> kmeans_float = KMeans(n_clusters=2, n_init="auto", random_state=1) >>> kmeans_float.fit(b) KMeans() >>> >>> # Labels: >>> kmeans_float.labels_ 0 0 1 0 2 1 3 1 dtype: int32 >>> # cluster_centers: >>> kmeans_float.cluster_centers_ 0 1 0 1.0 1.5 1 3.5 2.5
- fit(self, X, y=None, sample_weight=None, *, convert_dtype=True) 'KMeans'[source]#
Compute k-means clustering with X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- fit_predict(self, X, y=None, sample_weight=None) CumlArray[source]#
Compute cluster centers and predict cluster index for each sample.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns:
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- fit_transform(self, X, y=None, sample_weight=None, *, convert_dtype=False) CumlArray[source]#
Compute clustering and transform X to cluster-distance space.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = False)
When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)
Transformed data
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- predict(self, X, *, convert_dtype=True) CumlArray[source]#
Predict the closest cluster each sample in X belongs to.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- score(self, X, y=None, sample_weight=None, *, convert_dtype=True)[source]#
Opposite of the value of X on the K-means objective.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the score method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- scorefloat
Opposite of the value of X on the K-means objective.
- transform(self, X, *, convert_dtype=True) CumlArray[source]#
Transform X to a cluster-distance space.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)
Transformed data
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.