Attention

The vector search and clustering algorithms in RAFT are being migrated to a new library dedicated to vector search called cuVS. We will continue to support the vector search algorithms in RAFT during this move, but will no longer update them after the RAPIDS 24.06 (June) release. We plan to complete the migration by RAPIDS 24.10 (October) release and they will be removed from RAFT altogether in the 24.12 (December) release.

Cluster#

This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster package.

KMeans#

Specifies hyper-parameters for the kmeans algorithm.

Parameters:

n_clustersint, optional: The number of clusters to form as well as the number of centroids to generate
max_iterint, optional: Maximum number of iterations of the k-means algorithm for a single run
tolfloat, optional: Relative tolerance with regards to inertia to declare convergence
verbosityint, optional
seed: int, optional: Seed to the random number generator.
metricstr, optional: Metric names to use for distance computation, see pylibraft.distance.pairwise_distance() for valid values.
initInitMethod, optional
n_initint, optional: Number of instance k-means algorithm will be run with different seeds.
oversampling_factorfloat, optional: Oversampling factor for use in the k-means algorithm

Attributes:

batch_centroids
batch_samples
inertia_check
init
max_iter
n_clusters
oversampling_factor
seed
tol
verbosity

pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#

Find clusters with the k-means algorithm

Parameters:

paramsKMeansParams: Parameters to use to fit KMeans model
XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix: shape (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:

centroidsraft.device_ndarray: The computed centroids for each cluster
inertiafloat: Sum of squared distances of samples to their closest cluster center
n_iterint: The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import fit, KMeansParams
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)

>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)

pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:

XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape: (n_clusters, k)
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import cluster_cost
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)

pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#

Compute new centroids given an input matrix and existing centroids

Parameters:

XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape: (n_clusters, k)
labelsInput CUDA array interface compliant matrix shape: (m, 1)
new_centroidsWritable CUDA array interface compliant matrix shape: (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
weight_per_clusterOptional writable CUDA array interface compliant: matrix shape (n_clusters, 1) default: None
batch_samplesOptional integer specifying the batch size for X to compute: distances in batches. default: m
batch_centroidsOptional integer specifying the batch size for centroids: to compute distances in batches. default: n_clusters
handleOptional RAFT resource handle for reusing CUDA resources.: If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import Handle
>>> from pylibraft.cluster.kmeans import compute_new_centroids
>>> # A single RAFT handle can optionally be reused across
>>> # pylibraft functions.
>>> handle = Handle()
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                               dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                         dtype=cp.float32)
...
>>> labels = cp.random.randint(0, high=n_clusters, size=n_samples,
...                            dtype=cp.int32)
>>> new_centroids = cp.empty((n_clusters, n_features),
...                          dtype=cp.float32)
>>> compute_new_centroids(
...     X, centroids, labels, new_centroids, handle=handle
... )
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()