K-Means#
Parameters#
#include <cuvs/cluster/kmeans.h>
-
enum cuvsKMeansInitMethod#
Values:
-
enumerator KMeansPlusPlus#
Sample the centroids using the kmeans++ strategy
-
enumerator Random#
Sample the centroids uniformly at random
-
enumerator Array#
User provides the array of initial centroids
-
enumerator KMeansPlusPlus#
-
enum cuvsKMeansType#
Type of k-means algorithm.
Values:
-
enumerator CUVS_KMEANS_TYPE_KMEANS#
-
enumerator CUVS_KMEANS_TYPE_KMEANS_BALANCED#
-
enumerator CUVS_KMEANS_TYPE_KMEANS#
-
typedef struct cuvsKMeansParams *cuvsKMeansParams_t#
-
cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t *params)#
Allocate KMeans params, and populate with default values.
- Parameters:
params – [in] cuvsKMeansParams_t to allocate
- Returns:
-
cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params)#
De-allocate KMeans params.
- Parameters:
params – [in]
- Returns:
-
struct cuvsKMeansParams#
- #include <kmeans.h>
Hyper-parameters for the kmeans algorithm.
Public Members
-
int n_clusters#
The number of clusters to form as well as the number of centroids to generate (default:8).
-
cuvsKMeansInitMethod init#
Method for initialization, defaults to k-means++:
cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
-
int max_iter#
Maximum number of iterations of the k-means algorithm for a single run.
-
double tol#
Relative tolerance with regards to inertia to declare convergence.
-
int n_init#
Number of instance k-means algorithm will be run with different seeds.
-
double oversampling_factor#
Oversampling factor for use in the k-means|| algorithm
-
int batch_samples#
batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
-
int batch_centroids#
if 0 then batch_centroids = n_clusters
-
bool inertia_check#
Check inertia during iterations for early convergence.
-
bool hierarchical#
Whether to use hierarchical (balanced) kmeans or not
-
int hierarchical_n_iters#
For hierarchical k-means , defines the number of training iterations
-
int64_t streaming_batch_size#
Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
-
int n_clusters#
Functions#
#include <cuvs/cluster/kmeans.h>
- cuvsError_t cuvsKMeansFit(
- cuvsResources_t res,
- cuvsKMeansParams_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- double *inertia,
- int *n_iter
Find clusters with k-means algorithm.
Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.
X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroids – [inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
n_iter – [out] Number of iterations run.
- cuvsError_t cuvsKMeansPredict(
- cuvsResources_t res,
- cuvsKMeansParams_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- DLManagedTensor *labels,
- bool normalize_weight,
- double *inertia
Predict the closest cluster each sample in X belongs to.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] New data to predict. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. [len = n_samples]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
normalize_weight – [in] True if the weights should be normalized
labels – [out] Index of the cluster each sample in X belongs to. [len = n_samples]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
- cuvsError_t cuvsKMeansClusterCost(
- cuvsResources_t res,
- DLManagedTensor *X,
- DLManagedTensor *centroids,
- double *cost
Compute cluster cost.
- Parameters:
res – [in] opaque C handle
X – [in] Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
cost – [out] Resulting cluster cost