Attention

The vector search and clustering algorithms in RAFT are being migrated to a new library dedicated to vector search called cuVS. We will continue to support the vector search algorithms in RAFT during this move, but will no longer update them after the RAPIDS 24.06 (June) release. We plan to complete the migration by RAPIDS 24.10 (October) release and they will be removed from RAFT altogether in the 24.12 (December) release.

Cluster#

This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster package.

KMeans#

class pylibraft.cluster.kmeans.KMeansParams(n_clusters: int | None = None, max_iter: int | None = None, tol: float | None = None, verbosity: int | None = None, seed: int | None = None, metric: str | None = None, init: InitMethod | None = None, n_init: int | None = None, oversampling_factor: float | None = None, batch_samples: int | None = None, batch_centroids: int | None = None, inertia_check: bool | None = None)#

Specifies hyper-parameters for the kmeans algorithm.

Parameters:
n_clustersint, optional

The number of clusters to form as well as the number of centroids to generate

max_iterint, optional

Maximum number of iterations of the k-means algorithm for a single run

tolfloat, optional

Relative tolerance with regards to inertia to declare convergence

verbosityint, optional
seed: int, optional

Seed to the random number generator.

metricstr, optional

Metric names to use for distance computation, see pylibraft.distance.pairwise_distance() for valid values.

initInitMethod, optional
n_initint, optional

Number of instance k-means algorithm will be run with different seeds.

oversampling_factorfloat, optional

Oversampling factor for use in the k-means algorithm

Attributes:
batch_centroids
batch_samples
inertia_check
init
max_iter
n_clusters
oversampling_factor
seed
tol
verbosity
pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#

Find clusters with the k-means algorithm

Parameters:
paramsKMeansParams

Parameters to use to fit KMeans model

XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix

shape (n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Returns:
centroidsraft.device_ndarray

The computed centroids for each cluster

inertiafloat

Sum of squared distances of samples to their closest cluster center

n_iterint

The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import fit, KMeansParams
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)
pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.cluster.kmeans import cluster_cost
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)
pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#

Compute new centroids given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

labelsInput CUDA array interface compliant matrix shape

(m, 1)

new_centroidsWritable CUDA array interface compliant matrix shape

(n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

weight_per_clusterOptional writable CUDA array interface compliant

matrix shape (n_clusters, 1) default: None

batch_samplesOptional integer specifying the batch size for X to compute

distances in batches. default: m

batch_centroidsOptional integer specifying the batch size for centroids

to compute distances in batches. default: n_clusters

handleOptional RAFT resource handle for reusing CUDA resources.

If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling handle.sync() before accessing the output.

Examples

>>> import cupy as cp
>>> from pylibraft.common import Handle
>>> from pylibraft.cluster.kmeans import compute_new_centroids
>>> # A single RAFT handle can optionally be reused across
>>> # pylibraft functions.
>>> handle = Handle()
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>> X = cp.random.random_sample((n_samples, n_features),
...                               dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                         dtype=cp.float32)
...
>>> labels = cp.random.randint(0, high=n_clusters, size=n_samples,
...                            dtype=cp.int32)
>>> new_centroids = cp.empty((n_clusters, n_features),
...                          dtype=cp.float32)
>>> compute_new_centroids(
...     X, centroids, labels, new_centroids, handle=handle
... )
>>> # pylibraft functions are often asynchronous so the
>>> # handle needs to be explicitly synchronized
>>> handle.sync()