KMeans#

class cuml.dask.cluster.KMeans(*, client=None, verbose=False, **kwargs)[source]#

Multi-Node Multi-GPU implementation of KMeans.

This version minimizes data transfer by sharing only the centroids between workers in each iteration.

Predictions are done embarrassingly parallel, using cuML’s single-GPU version.

For more information on this implementation, refer to the documentation for single-GPU K-Means.

Parameters:

n_clustersint (default = 8): The number of centroids or clusters you want.
max_iterint (default = 300): The more iterations of EM, the more accurate, but slower.
tolfloat (default = 1e-4): Stopping criterion when centroid means do not change much.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
random_stateint or None (default = None): If you want results to be the same when you restart Python, select a state.
init{‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray} (default = ‘scalable-k-means++’): ‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ initialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
oversampling_factorint (default = 2): The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.
max_samples_per_batchint (default = 32768): The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

Attributes:

cluster_centers_cuDF DataFrame or CuPy ndarray: The coordinates of the final clusters. This represents of “mean” of each data cluster.

Methods

`fit`(X[, sample_weight])	Fit a multi-node multi-GPU KMeans model
`fit_predict`(X[, sample_weight, delayed])	Compute cluster centers and predict cluster index for each sample.
`fit_transform`(X[, sample_weight, delayed])	Calls fit followed by transform using a distributed KMeans model
`predict`(X[, delayed])	Predict labels for the input
`score`(X[, sample_weight])	Computes the inertia score for the trained KMeans centroids.
`transform`(X[, delayed])	Transforms the input into the learned centroid space

fit(X, sample_weight=None)[source]#

Fit a multi-node multi-GPU KMeans model

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array
Training data to cluster.
sample_weightDask cuDF DataFrame or CuPy backed Dask Array shape = (n_samples,), default=None # noqa: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict(X, sample_weight=None, delayed=True)[source]#

Compute cluster centers and predict cluster index for each sample.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing predictions

fit_transform(X, sample_weight=None, delayed=True)[source]#

Calls fit followed by transform using a distributed KMeans model

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to execute as a delayed task or eager.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing the transformed data

predict(X, delayed=True)[source]#

Predict labels for the input

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing predictions

score(X, sample_weight=None)[source]#

Computes the inertia score for the trained KMeans centroids.

Parameters:

Xdask_cudf.Dataframe: Dataframe to compute score

Returns:

Inertial score

transform(X, delayed=True)[source]#

Transforms the input into the learned centroid space

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to execute as a delayed task or eager.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing the transformed data