K-Means#

K-Means Parameters#

class cuvs.cluster.kmeans.KMeansParams(
metric=None,
*,
n_clusters=None,
init_method=None,
max_iter=None,
tol=None,
n_init=None,
oversampling_factor=None,
batch_samples=None,
batch_centroids=None,
inertia_check=None,
streaming_batch_size=None,
hierarchical=None,
hierarchical_n_iters=None,
)#

Hyper-parameters for the kmeans algorithm

Parameters:
metricstr

String denoting the metric type.

n_clustersint

The number of clusters to form as well as the number of centroids to generate

init_methodstr

Method for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers

max_iterint

Maximum number of iterations of the k-means algorithm for a single run

tolfloat

Relative tolerance with regards to inertia to declare convergence.

n_initint

Number of instance k-means algorithm will be run with different seeds

oversampling_factordouble

Oversampling factor for use in the k-means|| algorithm

batch_samplesint

Number of samples to process in each batch for tiled 1NN computation. Useful to optimize/control memory footprint. Default tile is [batch_samples x n_clusters].

batch_centroidsint

Number of centroids to process in each batch. If 0, uses n_clusters.

inertia_checkbool

If True, check inertia during iterations for early convergence.

streaming_batch_sizeint

Number of samples to process per GPU batch when fitting with host (numpy) data. When set to 0, defaults to n_samples (process all at once). Only used by the batched (host-data) code path. Reducing streaming_batch_size can help reduce GPU memory pressure but increases overhead as the number of times centroid adjustments are computed increases.

Default: 0 (process all data at once).

hierarchicalbool

Whether to use hierarchical (balanced) kmeans or not

hierarchical_n_itersint

For hierarchical k-means , defines the number of training iterations

Attributes:
batch_centroids
batch_samples
hierarchical
hierarchical_n_iters
inertia_check
init_method
max_iter
metric
n_clusters
n_init
oversampling_factor
streaming_batch_size
tol

K-Means Fit#

cuvs.cluster.kmeans.fit(
KMeansParams params,
X,
centroids=None,
sample_weights=None,
resources=None,
)[source]#

Find clusters with the k-means algorithm

When X is a device array (CUDA array interface), standard on-device k-means is used. When X is a host array (numpy ndarray or __array_interface__), data is streamed to the GPU in batches controlled by params.streaming_batch_size. For large host datasets, consider reducing streaming_batch_size to reduce GPU memory usage.

Parameters:
paramsKMeansParams

Parameters to use to fit KMeans model. For host data, params.streaming_batch_size controls how many samples are sent to the GPU per batch.

Xarray-like

Training instances, shape (m, k). Accepts both device arrays (cupy / CUDA array interface) and host arrays (numpy).

centroidsOptional writable CUDA array interface compliant matrix

shape (n_clusters, k)

sample_weightsOptional weights per observation. Must reside on

the same memory space as X (device or host). default: None

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
centroidsraft.device_ndarray

The computed centroids for each cluster

inertiafloat

Sum of squared distances of samples to their closest cluster center

n_iterint

The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)

Host-data (batched) example:

>>> import numpy as np
>>> X_host = np.random.random((10_000_000, 128)).astype(np.float32)
>>> params = KMeansParams(n_clusters=1000, streaming_batch_size=1_000_000)
>>> centroids, inertia, n_iter = fit(params, X_host)

K-Means Predict#

cuvs.cluster.kmeans.predict(
KMeansParams params,
X,
centroids,
sample_weights=None,
labels=None,
normalize_weight=True,
resources=None,
)[source]#

Predict clusters with the k-means algorithm

Parameters:
paramsKMeansParams

Parameters to used in fitting KMeans model

XInput CUDA array interface compliant matrix shape (m, k)
centroidsCUDA array interface compliant matrix, calculated by fit

shape (n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

labelsOptional preallocated CUDA array interface matrix shape (m, 1)

to hold the output

normalize_weight: bool

True if the weights should be normalized

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
labelsraft.device_ndarray

The label for each datapoint in X

inertiafloat

Sum of squared distances of samples to their closest cluster center

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, predict, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)
>>>
>>> labels, inertia = predict(params, X, centroids)

K-Means Cluster Cost#

cuvs.cluster.kmeans.cluster_cost(X, centroids, resources=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
inertiafloat

The cluster cost between the input matrix and existing centroids

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import cluster_cost
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)