simplicial_set_embedding#

cuml.manifold.umap.simplicial_set_embedding(data, graph, n_components=2, initial_alpha=1.0, a=None, b=None, gamma=1.0, negative_sample_rate=5, n_epochs=None, init='spectral', random_state=None, force_serial_epochs=None, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, convert_dtype=True, verbose=False)[source]#

Perform a fuzzy simplicial set embedding, using a specified initialisation method and then minimizing the fuzzy set cross entropy between the 1-skeletons of the high and low dimensional fuzzy simplicial sets.

Parameters:
data: array of shape (n_samples, n_features)

The source data to be embedded by UMAP.

graph: sparse matrix

The 1-skeleton of the high dimensional fuzzy simplicial set as represented by a graph for which we require a sparse matrix for the (weighted) adjacency matrix.

Note: When force_serial_epochs is enabled (either explicitly or via the auto-default for init='spectral' with n_components <= 512), the COO is required to be sorted by row for internal CSR conversion. If it is not, it will be sorted internally. To avoid the extra sort, pass a row-sorted COO.

n_components: int

The dimensionality of the euclidean space into which to embed the data.

initial_alpha: float

Initial learning rate for the SGD.

a: float

Parameter of differentiable approximation of right adjoint functor

b: float

Parameter of differentiable approximation of right adjoint functor

gamma: float

Weight to apply to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

n_epochs: int (optional, default 0)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If 0 is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

init: string
How to initialize the low dimensional embedding. Options are:
  • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

  • ‘random’: assign initial embedding positions at random.

  • An array-like with initial embedding positions.

Note: When init='spectral' and n_components <= 512, force_serial_epochs defaults to True because spectral initialization is more susceptible to outlier artifacts. Pass force_serial_epochs=False explicitly to disable and use the faster parallel batch kernel.

random_state: numpy RandomState or equivalent

A state capable being used as a numpy random state.

force_serial_epochs: bool or None, optional (default=None)

Controls whether optimization epochs use the sequential (reduced GPU parallelism) kernel. When None (the default), serial epochs are enabled automatically for init='spectral' with n_components <= 512 because spectral initialization is more susceptible to outlier artifacts; for n_components > 512 the auto-default falls back to False since the serial kernel does not support that range. Pass True to force serial epochs regardless of init (only supported for n_components <= 512; otherwise a ValueError is raised), or False to disable them.

metric: string (default=’euclidean’).

Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The ‘jaccard’ distance metric is only supported for sparse inputs.

metric_kwds: dict (optional, default=None)

Metric argument

output_metric: function

Function returning the distance between two points in embedding space and the gradient of the distance wrt the first argument.

output_metric_kwds: dict

Key word arguments to be passed to the output_metric function.

verbose: bool (optional, default False)

Whether to report information on the current progress of the algorithm.

Returns:
embedding: array of shape (n_samples, n_components)

The optimized of graph into an n_components dimensional euclidean space.