Attention

The vector search and clustering algorithms in RAFT are being migrated to a new library dedicated to vector search called cuVS. We will continue to support the vector search algorithms in RAFT during this move, but will no longer update them after the RAPIDS 24.06 (June) release. We plan to complete the migration by RAPIDS 24.08 (August) release.

Hierarchical Clustering#

#include <raft/cluster/single_linkage.cuh>

enum raft::cluster::hierarchy::LinkageDistance#

Determines the method for computing the minimum spanning tree (MST)

Values:

enumerator PAIRWISE#

Use a pairwise distance matrix as input to the mst. This is very fast and the best option for fairly small datasets (~50k data points)

enumerator KNN_GRAPH#

Construct a KNN graph as input to the mst and provide additional edges if the mst does not converge. This is slower but scales to very large datasets.

constexpr int raft::cluster::hierarchy::DEFAULT_CONST_C = 15#
template<typename value_t, typename idx_t, LinkageDistance dist_type = LinkageDistance::KNN_GRAPH>
void raft::cluster::hierarchy::single_linkage(raft::resources const &handle, raft::device_matrix_view<const value_t, idx_t, row_major> X, raft::device_matrix_view<idx_t, idx_t, row_major> dendrogram, raft::device_vector_view<idx_t, idx_t> labels, raft::distance::DistanceType metric, size_t n_clusters, std::optional<int> c = std::make_optional<int>(DEFAULT_CONST_C))#

Single-linkage clustering, capable of constructing a KNN graph to scale the algorithm beyond the n^2 memory consumption of implementations that use the fully-connected graph of pairwise distances by connecting a knn graph when k is not large enough to connect it.

Template Parameters:
  • value_idx

  • value_t

  • dist_type – method to use for constructing connectivities graph

Parameters:
  • handle[in] raft handle

  • X[in] dense input matrix in row-major layout

  • dendrogram[out] output dendrogram (size [n_rows - 1] * 2)

  • labels[out] output labels vector (size n_rows)

  • metric[in] distance metrix to use when constructing connectivities graph

  • n_clusters[in] number of clusters to assign data samples

  • c[in] a constant used when constructing connectivities from knn graph. Allows the indirect control of k. The algorithm will set k = log(n) + c