K-Means#

Parameters#

#include <cuvs/cluster/kmeans.h>

enum cuvsKMeansInitMethod#

Values:

enumerator KMeansPlusPlus#: Sample the centroids using the kmeans++ strategy

enumerator Random#: Sample the centroids uniformly at random

enumerator Array#: User provides the array of initial centroids

enum cuvsKMeansType#

Type of k-means algorithm.

Values:

enumerator CUVS_KMEANS_TYPE_KMEANS#

enumerator CUVS_KMEANS_TYPE_KMEANS_BALANCED#

typedef struct cuvsKMeansParams *cuvsKMeansParams_t#

typedef struct cuvsKMeansParams_v2 *cuvsKMeansParams_v2_t#

cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t *params)#

Allocate KMeans params, and populate with default values.

Note

In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansParamsCreate_v2.

Parameters:: params – [in] cuvsKMeansParams_t to allocate
Returns:: cuvsError_t

cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params)#

De-allocate KMeans params.

Note

In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansParamsDestroy_v2.

Parameters:: params – [in]
Returns:: cuvsError_t

cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t *params)#

Allocate KMeans params.

Mirrors cuvsKMeansParamsCreate but operates on cuvsKMeansParams_v2. Will become the unsuffixed cuvsKMeansParamsCreate in cuVS 26.08.

Parameters:: params – [in] cuvsKMeansParams_v2_t to allocate
Returns:: cuvsError_t

cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params)#

De-allocate KMeans params allocated by cuvsKMeansParamsCreate_v2.

Parameters:: params – [in]
Returns:: cuvsError_t

struct cuvsKMeansParams#

#include <kmeans.h>

Hyper-parameters for the kmeans algorithm NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v2. TODO: CalVer for the replacement: 26.08.

Public Members

int n_clusters#: The number of clusters to form as well as the number of centroids to generate (default:8).

cuvsKMeansInitMethod init#

Method for initialization, defaults to k-means++:

cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.

int max_iter#: Maximum number of iterations of the k-means algorithm for a single run.

double tol#: Relative tolerance with regards to inertia to declare convergence.

int n_init#: Number of instance k-means algorithm will be run with different seeds.

double oversampling_factor#: Oversampling factor for use in the k-means|| algorithm

int batch_samples#: batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids

int batch_centroids#: if 0 then batch_centroids = n_clusters

bool inertia_check#: Deprecated, ignored. Kept for ABI compatibility.

bool hierarchical#: Whether to use hierarchical (balanced) kmeans or not

int hierarchical_n_iters#: For hierarchical k-means , defines the number of training iterations

int64_t streaming_batch_size#: Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).

int64_t init_size#: Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.

struct cuvsKMeansParams_v2#

#include <kmeans.h>

Hyper-parameters for the kmeans algorithm TODO: Remove this after cuvsKMeansParams is replaced in ABI 2.0.

Public Members

int n_clusters#: The number of clusters to form as well as the number of centroids to generate (default:8).

cuvsKMeansInitMethod init#

Method for initialization, defaults to k-means++:

cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.

int max_iter#: Maximum number of iterations of the k-means algorithm for a single run.

double tol#: Relative tolerance with regards to inertia to declare convergence.

int n_init#: Number of instance k-means algorithm will be run with different seeds.

double oversampling_factor#: Oversampling factor for use in the k-means|| algorithm

int batch_samples#: batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids

int batch_centroids#: if 0 then batch_centroids = n_clusters

bool hierarchical#: Whether to use hierarchical (balanced) kmeans or not

int hierarchical_n_iters#: For hierarchical k-means , defines the number of training iterations

int64_t streaming_batch_size#: Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).

int64_t init_size#: Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.

Functions#

#include <cuvs/cluster/kmeans.h>

cuvsError_t cuvsKMeansFit( cuvsResources_t res, cuvsKMeansParams_t params, DLManagedTensor *X, DLManagedTensor *sample_weight, DLManagedTensor *centroids, double *inertia, int *n_iter )#

Find clusters with k-means algorithm.

Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.

Note

In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansFit_v2.

Parameters:

res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroids – [inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
n_iter – [out] Number of iterations run.

cuvsError_t cuvsKMeansFit_v2( cuvsResources_t res, cuvsKMeansParams_v2_t params, DLManagedTensor *X, DLManagedTensor *sample_weight, DLManagedTensor *centroids, double *inertia, int *n_iter )#

Find clusters with k-means algorithm (v2 params layout).

Mirrors cuvsKMeansFit but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansFit in cuVS 26.08.

Parameters:

res – [in] opaque C handle
params – [in] Parameters for KMeans model (v2 layout).
X – [in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroids – [inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
n_iter – [out] Number of iterations run.

cuvsError_t cuvsKMeansPredict( cuvsResources_t res, cuvsKMeansParams_t params, DLManagedTensor *X, DLManagedTensor *sample_weight, DLManagedTensor *centroids, DLManagedTensor *labels, bool normalize_weight, double *inertia )#

Predict the closest cluster each sample in X belongs to.

Note

In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansPredict_v2.

Parameters:

res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] New data to predict. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. [len = n_samples]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
normalize_weight – [in] True if the weights should be normalized
labels – [out] Index of the cluster each sample in X belongs to. [len = n_samples]
inertia – [out] Sum of squared distances of samples to their closest cluster center.

cuvsError_t cuvsKMeansPredict_v2( cuvsResources_t res, cuvsKMeansParams_v2_t params, DLManagedTensor *X, DLManagedTensor *sample_weight, DLManagedTensor *centroids, DLManagedTensor *labels, bool normalize_weight, double *inertia )#

Predict the closest cluster each sample in X belongs to (v2 params layout).

Mirrors cuvsKMeansPredict but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansPredict in cuVS 26.08.

Parameters:

res – [in] opaque C handle
params – [in] Parameters for KMeans model (v2 layout).
X – [in] New data to predict. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. [len = n_samples]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
normalize_weight – [in] True if the weights should be normalized
labels – [out] Index of the cluster each sample in X belongs to. [len = n_samples]
inertia – [out] Sum of squared distances of samples to their closest cluster center.

cuvsError_t cuvsKMeansClusterCost( cuvsResources_t res, DLManagedTensor *X, DLManagedTensor *centroids, double *cost )#

Compute cluster cost.

Parameters:

res – [in] opaque C handle
X – [in] Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
cost – [out] Resulting cluster cost