K-Means#

Parameters#

#include <cuvs/cluster/kmeans.h>

enum cuvsKMeansInitMethod#

Values:

enumerator KMeansPlusPlus#

Sample the centroids using the kmeans++ strategy

enumerator Random#

Sample the centroids uniformly at random

enumerator Array#

User provides the array of initial centroids

enum cuvsKMeansType#

Type of k-means algorithm.

Values:

enumerator CUVS_KMEANS_TYPE_KMEANS#
enumerator CUVS_KMEANS_TYPE_KMEANS_BALANCED#
typedef struct cuvsKMeansParams *cuvsKMeansParams_t#
cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t *params)#

Allocate KMeans params, and populate with default values.

Parameters:

params[in] cuvsKMeansParams_t to allocate

Returns:

cuvsError_t

cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params)#

De-allocate KMeans params.

Parameters:

params[in]

Returns:

cuvsError_t

struct cuvsKMeansParams#
#include <kmeans.h>

Hyper-parameters for the kmeans algorithm.

Public Members

int n_clusters#

The number of clusters to form as well as the number of centroids to generate (default:8).

cuvsKMeansInitMethod init#

Method for initialization, defaults to k-means++:

  • cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.

  • cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.

  • cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.

int max_iter#

Maximum number of iterations of the k-means algorithm for a single run.

double tol#

Relative tolerance with regards to inertia to declare convergence.

int n_init#

Number of instance k-means algorithm will be run with different seeds.

double oversampling_factor#

Oversampling factor for use in the k-means|| algorithm

int batch_samples#

batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids

int batch_centroids#

if 0 then batch_centroids = n_clusters

bool inertia_check#

Check inertia during iterations for early convergence.

bool hierarchical#

Whether to use hierarchical (balanced) kmeans or not

int hierarchical_n_iters#

For hierarchical k-means , defines the number of training iterations

int64_t streaming_batch_size#

Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).

Functions#

#include <cuvs/cluster/kmeans.h>

cuvsError_t cuvsKMeansFit(
cuvsResources_t res,
cuvsKMeansParams_t params,
DLManagedTensor *X,
DLManagedTensor *sample_weight,
DLManagedTensor *centroids,
double *inertia,
int *n_iter
)#

Find clusters with k-means algorithm.

Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.

Parameters:
  • res[in] opaque C handle

  • params[in] Parameters for KMeans model.

  • X[in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]

  • sample_weight[in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]

  • centroids[inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]

  • inertia[out] Sum of squared distances of samples to their closest cluster center.

  • n_iter[out] Number of iterations run.

cuvsError_t cuvsKMeansPredict(
cuvsResources_t res,
cuvsKMeansParams_t params,
DLManagedTensor *X,
DLManagedTensor *sample_weight,
DLManagedTensor *centroids,
DLManagedTensor *labels,
bool normalize_weight,
double *inertia
)#

Predict the closest cluster each sample in X belongs to.

Parameters:
  • res[in] opaque C handle

  • params[in] Parameters for KMeans model.

  • X[in] New data to predict. [dim = n_samples x n_features]

  • sample_weight[in] Optional weights for each observation in X. [len = n_samples]

  • centroids[in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]

  • normalize_weight[in] True if the weights should be normalized

  • labels[out] Index of the cluster each sample in X belongs to. [len = n_samples]

  • inertia[out] Sum of squared distances of samples to their closest cluster center.

cuvsError_t cuvsKMeansClusterCost(
cuvsResources_t res,
DLManagedTensor *X,
DLManagedTensor *centroids,
double *cost
)#

Compute cluster cost.

Parameters:
  • res[in] opaque C handle

  • X[in] Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]

  • centroids[in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]

  • cost[out] Resulting cluster cost