UMAP#
- class cuml.manifold.UMAP(*, n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, init='spectral', a=None, b=None, target_n_neighbors=-1, target_weight=0.5, target_metric='categorical', hash_input=False, random_state=None, force_serial_epochs=False, precomputed_knn=None, callback=None, build_algo='auto', build_kwds=None, device_ids=None, verbose=False, output_type=None)#
Uniform Manifold Approximation and Projection
Finds a low dimensional embedding of the data that approximates an underlying manifold.
Adapted from lmcinnes/umapumap.py
The UMAP algorithm is outlined in [1]. This implementation follows the GPU-accelerated version as described in [2].
- Parameters:
- n_neighbors: float (optional, default 15)
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
- n_components: int (optional, default 2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any
- metric: string (default=’euclidean’).
Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The ‘jaccard’ distance metric is only supported for sparse inputs. Note: If build_algo=`brute_force_knn` and
knn_n_clusters > 1, the metric must be one of [‘l2’, ‘sqeuclidean’, ‘euclidean’, ‘cosine’, ‘inner_product’].- metric_kwds: dict (optional, default=None)
Metric argument
- n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
- learning_rate: float (optional, default 1.0)
The initial learning rate for the embedding optimization.
- init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1-skeleton
‘random’: assign initial embedding positions at random.
An array-like with initial embedding positions.
- min_dist: float (optional, default 0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the
spreadvalue, which determines the scale at which embedded points will be spread out.- spread: float (optional, default 1.0)
The effective scale of embedded points. In combination with
min_distthis determines how clustered/clumped the embedded points are.- set_op_mix_ratio: float (optional, default 1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
- local_connectivity: int (optional, default 1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
- repulsion_strength: float (optional, default 1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
- negative_sample_rate: int (optional, default 5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
- transform_queue_size: float (optional, default 4.0)
For transform operations (embedding new points using a trained model this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
- a: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_distandspread.- b: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_distandspread.- target_n_neighbors: int (optional, default=-1)
The number of nearest neighbors to use to construct the target simplicial set. If set to -1 use the
n_neighborsvalue.- target_metric: string (optional, default=’categorical’)
The metric used to measure distance for a target array when using supervised dimension reduction. By default this is ‘categorical’ which will measure distance in terms of whether categories match or are different. Furthermore, if semi-supervised is required target values of -1 will be treated as unlabelled under the ‘categorical’ metric. If the target array takes continuous values (e.g. for a regression problem) then metric of ‘euclidean’ or ‘l2’ is probably more appropriate.
- target_weight: float (optional, default=0.5)
Weighting factor between data topology and target topology. A value of 0.0 weights predominantly on data, a value of 1.0 places a strong emphasis on target. The default of 0.5 balances the weighting equally between data and target.
- hash_input: bool, optional (default = False)
UMAP can hash the training input so that exact embeddings are returned when transform is called on the same data upon which the model was trained. This enables consistent behavior between calling
model.fit_transform(X)and callingmodel.fit(X).transform(X). Note that the CPU-based UMAP reference implementation does this by default. This feature is made optional in the GPU version due to the significant overhead in copying memory to the host for computing the hash.- precomputed_knnarray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.
- random_stateint, RandomState instance or None, optional (default=None)
Seed used by the random number generator for embedding initialization and optimizer sampling. Setting a random_state enables reproducible embeddings, but at the cost of slower training and increased memory usage. This is because high parallelism during optimization involves non-deterministic floating-point addition ordering.
Note: Explicitly setting
build_algo='nn_descent'will break reproducibility, as NN Descent produces non-deterministic KNN graphs.- force_serial_epochs: bool, optional (default=False)
If
True, optimization epochs will be executed with reduced GPU parallelism. This is only relevant whenrandom_stateis set. Enable this if you observe outliers in the resulting embeddings withrandom_stateconfigured. This may slow the optimization step by more than 2x, but end-to-end runtime is typically similar since optimization step is not the bottleneck. Use this to resolve rare edge cases where the default heuristics do not trigger.- callback: An instance of GraphBasedDimRedCallback class
Used to intercept the internal state of embeddings while they are being trained. Example of callback usage:
from cuml.internals import GraphBasedDimRedCallback class CustomCallback(GraphBasedDimRedCallback): def on_preprocess_end(self, embeddings): print(embeddings.copy_to_host()) def on_epoch_end(self, embeddings): print(embeddings.copy_to_host()) def on_train_end(self, embeddings): print(embeddings.copy_to_host())
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.- build_algo: string (default=’auto’)
How to build the knn graph. Supported build algorithms are [‘auto’, ‘brute_force_knn’, ‘nn_descent’]. ‘auto’ chooses to run with brute force knn if number of data rows is smaller than or equal to 50K. Otherwise, runs with nn descent.
- build_kwds: dict (optional, default=None)
Dictionary of parameters to configure the build algorithm. Default values:
nnd_graph_degree(int, default=64): Graph degree used for NN Descent. Must be ≥n_neighbors.nnd_intermediate_graph_degree(int, default=128): Intermediate graph degree for NN Descent. Must be >nnd_graph_degree.nnd_max_iterations(int, default=20): Max NN Descent iterations.nnd_termination_threshold(float, default=0.0001): Stricter threshold leads to better convergence but longer runtime.knn_n_clusters(int, default=1): Number of clusters for data partitioning. Higher values reduce memory usage at the cost of accuracy. Whenknn_n_clusters > 1, UMAP can process data larger than device memory.knn_overlap_factor(int, default=2): Number of clusters each data point belongs to. Valid only whenknn_n_clusters > 1. Must be < ‘knn_n_clusters’.
Hints:
Increasing
nnd_graph_degreeandnnd_max_iterationsmay improve accuracy.The ratio
knn_overlap_factor / knn_n_clustersimpacts memory usage. Approximately(knn_overlap_factor / knn_n_clusters) * num_rows_in_entire_datarows will be loaded onto device memory at once. E.g., 2/20 uses less device memory than 2/10.Larger
knn_overlap_factorresults in better accuracy of the final knn graph. E.g. While using similar amount of device memory,(knn_overlap_factor / knn_n_clusters)= 4/20 will have better accuracy than 2/10 at the cost of performance.Start with
knn_overlap_factor = 2and gradually increase (2->3->4 …) for better accuracy.Start with
knn_n_clusters = 4and increase (4 → 8 → 16…) for less GPU memory usage. This is independent from knn_overlap_factor as long as ‘knn_overlap_factor’ < ‘knn_n_clusters’.
- device_idslist[int], “all”, or None, default=None
The device IDs to use during fitting (only used when
build_algo=nn_descentandknn_n_clusters > 1). May be a list of ids,"all"(to use all available devices), orNone(to fit using a single GPU only). Default is None.
- Attributes:
- embedding_
Methods
fit(self, X[, y, convert_dtype, knn_graph])Fit X into an embedded space.
fit_transform(self, X[, y, convert_dtype, ...])Fit X into an embedded space and return that transformed output.
inverse_transform(self, X, *[, convert_dtype])Transform X in the existing embedded space back into the input data space and return that transformed output.
transform(self, X, *[, convert_dtype])Transform X into the existing embedded space and return that transformed output.
Notes
This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in
cuml.umap:Using a pre-computed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions
In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP.
References
- fit(self, X, y=None, *, convert_dtype=True, knn_graph=None) 'UMAP'[source]#
Fit X into an embedded space.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- harray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings. Takes precedence over the precomputed_knn parameter.
- fit_transform(self, X, y=None, *, convert_dtype=True, knn_graph=None) CumlArray[source]#
Fit X into an embedded space and return that transformed output.
There is a subtle difference between calling fit_transform(X) and calling fit().transform(). Calling fit_transform(X) will train the embeddings on X and return the embeddings. Calling fit(X).transform(X) will train the embeddings on X and then run a second optimization.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- harray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings. Takes precedence over the precomputed_knn parameter.
- Returns
- ——-
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Embedding of the data in low-dimensional space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- inverse_transform(self, X, *, convert_dtype=True) CumlArray[source]#
Transform X in the existing embedded space back into the input data space and return that transformed output.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_components)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)
Generated data points in data space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- transform(self, X, *, convert_dtype=True) CumlArray[source]#
Transform X into the existing embedded space and return that transformed output.
Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().
Specifically, the transform() function is stochastic: lmcinnes/umap#158
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Embedding of the data in low-dimensional space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.