Attention
The vector search and clustering algorithms in RAFT are being migrated to a new library dedicated to vector search called cuVS. We will continue to support the vector search algorithms in RAFT during this move, but will no longer update them after the RAPIDS 24.06 (June) release. We plan to complete the migration by RAPIDS 24.08 (August) release.
Cluster#
This page provides pylibraft class references for the publicly-exposed elements of the pylibraft.cluster
package.
KMeans#
- class pylibraft.cluster.kmeans.KMeansParams(n_clusters: int | None = None, max_iter: int | None = None, tol: float | None = None, verbosity: int | None = None, seed: int | None = None, metric: str | None = None, init: InitMethod | None = None, n_init: int | None = None, oversampling_factor: float | None = None, batch_samples: int | None = None, batch_centroids: int | None = None, inertia_check: bool | None = None)#
Specifies hyper-parameters for the kmeans algorithm.
- Parameters:
- n_clustersint, optional
The number of clusters to form as well as the number of centroids to generate
- max_iterint, optional
Maximum number of iterations of the k-means algorithm for a single run
- tolfloat, optional
Relative tolerance with regards to inertia to declare convergence
- verbosityint, optional
- seed: int, optional
Seed to the random number generator.
- metricstr, optional
Metric names to use for distance computation, see
pylibraft.distance.pairwise_distance()
for valid values.- initInitMethod, optional
- n_initint, optional
Number of instance k-means algorithm will be run with different seeds.
- oversampling_factorfloat, optional
Oversampling factor for use in the k-means algorithm
- Attributes:
- batch_centroids
- batch_samples
- inertia_check
- init
- max_iter
- n_clusters
- oversampling_factor
- seed
- tol
- verbosity
- pylibraft.cluster.kmeans.fit(KMeansParams params, X, centroids=None, sample_weights=None, handle=None)[source]#
Find clusters with the k-means algorithm
- Parameters:
- paramsKMeansParams
Parameters to use to fit KMeans model
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsOptional writable CUDA array interface compliant matrix
shape (n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
- Returns:
- centroidsraft.device_ndarray
The computed centroids for each cluster
- inertiafloat
Sum of squared distances of samples to their closest cluster center
- n_iterint
The number of iterations used to fit the model
Examples
>>> import cupy as cp >>> from pylibraft.cluster.kmeans import fit, KMeansParams >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X)
- pylibraft.cluster.kmeans.cluster_cost(X, centroids, handle=None)[source]#
Compute cluster cost given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp >>> from pylibraft.cluster.kmeans import cluster_cost >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32) >>> inertia = cluster_cost(X, centroids)
- pylibraft.cluster.kmeans.compute_new_centroids(X, centroids, labels, new_centroids, sample_weights=None, weight_per_cluster=None, handle=None)[source]#
Compute new centroids given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- labelsInput CUDA array interface compliant matrix shape
(m, 1)
- new_centroidsWritable CUDA array interface compliant matrix shape
(n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- weight_per_clusterOptional writable CUDA array interface compliant
matrix shape (n_clusters, 1) default: None
- batch_samplesOptional integer specifying the batch size for X to compute
distances in batches. default: m
- batch_centroidsOptional integer specifying the batch size for centroids
to compute distances in batches. default: n_clusters
- handleOptional RAFT resource handle for reusing CUDA resources.
If a handle isn’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If a handle is supplied, you will need to explicitly synchronize yourself by calling
handle.sync()
before accessing the output.
Examples
>>> import cupy as cp >>> from pylibraft.common import Handle >>> from pylibraft.cluster.kmeans import compute_new_centroids >>> # A single RAFT handle can optionally be reused across >>> # pylibraft functions. >>> handle = Handle() >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32) >>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32) ... >>> labels = cp.random.randint(0, high=n_clusters, size=n_samples, ... dtype=cp.int32) >>> new_centroids = cp.empty((n_clusters, n_features), ... dtype=cp.float32) >>> compute_new_centroids( ... X, centroids, labels, new_centroids, handle=handle ... ) >>> # pylibraft functions are often asynchronous so the >>> # handle needs to be explicitly synchronized >>> handle.sync()