K-Means#
K-Means Parameters#
- class cuvs.cluster.kmeans.KMeansParams(
- metric=None,
- *,
- n_clusters=None,
- init_method=None,
- max_iter=None,
- tol=None,
- n_init=None,
- oversampling_factor=None,
- batch_samples=None,
- batch_centroids=None,
- inertia_check=None,
- streaming_batch_size=None,
- hierarchical=None,
- hierarchical_n_iters=None,
Hyper-parameters for the kmeans algorithm
- Parameters:
- metricstr
String denoting the metric type.
- n_clustersint
The number of clusters to form as well as the number of centroids to generate
- init_methodstr
Method for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers
- max_iterint
Maximum number of iterations of the k-means algorithm for a single run
- tolfloat
Relative tolerance with regards to inertia to declare convergence.
- n_initint
Number of instance k-means algorithm will be run with different seeds
- oversampling_factordouble
Oversampling factor for use in the k-means|| algorithm
- batch_samplesint
Number of samples to process in each batch for tiled 1NN computation. Useful to optimize/control memory footprint. Default tile is [batch_samples x n_clusters].
- batch_centroidsint
Number of centroids to process in each batch. If 0, uses n_clusters.
- inertia_checkbool
If True, check inertia during iterations for early convergence.
- streaming_batch_sizeint
Number of samples to process per GPU batch when fitting with host (numpy) data. When set to 0, defaults to n_samples (process all at once). Only used by the batched (host-data) code path. Reducing streaming_batch_size can help reduce GPU memory pressure but increases overhead as the number of times centroid adjustments are computed increases.
Default: 0 (process all data at once).
- hierarchicalbool
Whether to use hierarchical (balanced) kmeans or not
- hierarchical_n_itersint
For hierarchical k-means , defines the number of training iterations
- Attributes:
- batch_centroids
- batch_samples
- hierarchical
- hierarchical_n_iters
- inertia_check
- init_method
- max_iter
- metric
- n_clusters
- n_init
- oversampling_factor
- streaming_batch_size
- tol
K-Means Fit#
- cuvs.cluster.kmeans.fit(
- KMeansParams params,
- X,
- centroids=None,
- sample_weights=None,
- resources=None,
Find clusters with the k-means algorithm
When X is a device array (CUDA array interface), standard on-device k-means is used. When X is a host array (numpy ndarray or
__array_interface__), data is streamed to the GPU in batches controlled byparams.streaming_batch_size. For large host datasets, consider reducingstreaming_batch_sizeto reduce GPU memory usage.- Parameters:
- paramsKMeansParams
Parameters to use to fit KMeans model. For host data,
params.streaming_batch_sizecontrols how many samples are sent to the GPU per batch.- Xarray-like
Training instances, shape (m, k). Accepts both device arrays (cupy / CUDA array interface) and host arrays (numpy).
- centroidsOptional writable CUDA array interface compliant matrix
shape (n_clusters, k)
- sample_weightsOptional weights per observation. Must reside on
the same memory space as X (device or host). default: None
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- centroidsraft.device_ndarray
The computed centroids for each cluster
- inertiafloat
Sum of squared distances of samples to their closest cluster center
- n_iterint
The number of iterations used to fit the model
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import fit, KMeansParams >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X)
Host-data (batched) example:
>>> import numpy as np >>> X_host = np.random.random((10_000_000, 128)).astype(np.float32) >>> params = KMeansParams(n_clusters=1000, streaming_batch_size=1_000_000) >>> centroids, inertia, n_iter = fit(params, X_host)
K-Means Predict#
- cuvs.cluster.kmeans.predict(
- KMeansParams params,
- X,
- centroids,
- sample_weights=None,
- labels=None,
- normalize_weight=True,
- resources=None,
Predict clusters with the k-means algorithm
- Parameters:
- paramsKMeansParams
Parameters to used in fitting KMeans model
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsCUDA array interface compliant matrix, calculated by fit
shape (n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- labelsOptional preallocated CUDA array interface matrix shape (m, 1)
to hold the output
- normalize_weight: bool
True if the weights should be normalized
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- labelsraft.device_ndarray
The label for each datapoint in X
- inertiafloat
Sum of squared distances of samples to their closest cluster center
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import fit, predict, KMeansParams >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X) >>> >>> labels, inertia = predict(params, X, centroids)
K-Means Cluster Cost#
- cuvs.cluster.kmeans.cluster_cost(X, centroids, resources=None)[source]#
Compute cluster cost given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()before accessing the output.
- Returns:
- inertiafloat
The cluster cost between the input matrix and existing centroids
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import cluster_cost >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)