TSNE#

class cuml.manifold.TSNE(*, n_components=2, perplexity=30.0, early_exaggeration=12.0, late_exaggeration=1.0, learning_rate=200.0, max_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', metric_params=None, init='random', random_state=None, method='fft', angle=0.5, n_neighbors=90, perplexity_max_iter=100, exaggeration_iter=250, pre_momentum=0.5, post_momentum=0.8, learning_rate_method='adaptive', square_distances=True, precomputed_knn=None, verbose=False, output_type=None)#

t-SNE (T-Distributed Stochastic Neighbor Embedding) is an extremely powerful dimensionality reduction technique that aims to maintain local distances between data points. It is extremely robust to whatever dataset you give it, and is used in many areas including cancer research, music analysis and neural network weight visualizations.

cuML’s t-SNE supports three algorithms: the original exact algorithm, the Barnes-Hut approximation and the fast Fourier transform interpolation approximation. The latter two are derived from CannyLabs’ open-source CUDA code and produce extremely fast embeddings when n_components = 2. The exact algorithm is more accurate, but too slow to use on large datasets.

Parameters:

n_componentsint (default 2): The output dimensionality size. Currently only 2 is supported.
perplexityfloat (default 30.0): Larger datasets require a larger value. Consider choosing different perplexity values from 5 to 50 and see the output differences.
early_exaggerationfloat (default 12.0): Controls the space between clusters. Not critical to tune this.
late_exaggerationfloat (default 1.0): Controls the space between clusters. It may be beneficial to increase this slightly to improve cluster separation. This will be applied after exaggeration_iter iterations (FFT only).
learning_ratefloat (default 200.0): The learning rate usually between (10, 1000). If this is too high, t-SNE could look like a cloud / ball of points.
max_iterint (default 1000): The more epochs, the more stable/accurate the final embedding.
n_iter_without_progressint (default 300): Currently unused. When the KL Divergence becomes too small after some iterations, terminate t-SNE early.
min_grad_normfloat (default 1e-07): The minimum gradient norm for when t-SNE will terminate early. Used in the ‘exact’ and ‘fft’ algorithms. Consider reducing if the embeddings are unsatisfactory. It’s recommended to use a smaller value for smaller datasets.
metricstr (default=’euclidean’).: Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘correlation’]
initstr ‘random’ or ‘pca’ (default ‘random’): Currently supports random or pca initialization.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
random_stateint (default None): Setting this can make repeated runs look more similar. Note, however, that this highly parallelized t-SNE implementation is not completely deterministic between runs, even with the same random_state.
methodstr ‘fft’, ‘barnes_hut’ or ‘exact’ (default ‘fft’): ‘barnes_hut’ and ‘fft’ are fast approximations. ‘exact’ is more accurate but slower.
anglefloat (default 0.5): Valid values are between 0.0 and 1.0, which trade off speed and accuracy, respectively. Generally, these values are set between 0.2 and 0.8. (Barnes-Hut only.)
learning_rate_methodstr ‘adaptive’, ‘none’ or None (default ‘adaptive’): Either adaptive or None. ‘adaptive’ tunes the learning rate, early exaggeration, perplexity and n_neighbors automatically based on input size.
n_neighborsint (default 90): The number of datapoints you want to use in the attractive forces. Smaller values are better for preserving local structure, whilst larger values can improve global structure preservation. Default is 3 * 30 (perplexity)
perplexity_max_iterint (default 100): The number of epochs the best gaussian bands are found for.
exaggeration_iterint (default 250): To promote the growth of clusters, set this higher.
pre_momentumfloat (default 0.5): During the exaggeration iteration, more forcefully apply gradients.
post_momentumfloat (default 0.8): During the late phases, less forcefully apply gradients.
square_distancesboolean, default=True: Whether TSNE should square the distance values. Internally, this will be used to compute a kNN graph using the provided metric and then squaring it when True. If a knn_graph is passed to fit or fit_transform methods, all the distances will be squared when True. For example, if a knn_graph was obtained using ‘sqeuclidean’ metric, the distances will still be squared when True. Note: This argument should likely be set to False for distance metrics other than ‘euclidean’ and ‘l2’.
precomputed_knnarray / sparse array / tuple, optional (device or host): Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of TSNE and also allows the use of a custom distance function. This function should match the metric used to train the TSNE embeedings.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

embedding_array: Stores the embedding vectors.
kl_divergence_float: TSNE.kl_divergence_(self)
learning_rate_float: Effective learning rate.
n_iter_int: Number of iterations run.

Methods

`fit`(self, X[, y, convert_dtype, knn_graph])	Fit X into an embedded space.
`fit_transform`(self, X[, y, convert_dtype, ...])	Fit X into an embedded space and return that transformed output.

References

[1]

van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding

[2]

van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.

[3]

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding

Tip

Maaten and Linderman showcased how t-SNE can be very sensitive to both the starting conditions (i.e. random initialization), and how parallel versions of t-SNE can generate vastly different results between runs. You can run t-SNE multiple times to settle on the best configuration. Note that using the same random_state across runs does not guarantee similar results each time.

Note

The CUDA implementation is derived from the excellent CannyLabs open source implementation here: https://github.com/CannyLab/tsne-cuda/. The CannyLabs code is licensed according to the conditions in cuml/cpp/src/tsne/cannylabs_tsne_license.txt. A full description of their approach is available in their article t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data (https://arxiv.org/abs/1807.11824).

fit(self, X, y=None, *, convert_dtype=True, knn_graph=None) → 'TSNE'[source]#

Fit X into an embedded space.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.
harray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of
arrays of shape (n_samples, n_neighbors), a pairwise distances
dense array of shape (n_samples, n_samples) or a KNN graph
sparse array (preferably CSR/COO). This feature allows
the precomputation of the KNN outside of TSNE
and also allows the use of a custom distance function. This function
should match the metric used to train the TSNE embeedings.
Takes precedence over the precomputed_knn parameter.

fit_transform(self, X, y=None, *, convert_dtype=True, knn_graph=None) → CumlArray[source]#

Fit X into an embedded space and return that transformed output.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Embedding of the data in low-dimensional space.

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

property kl_divergence_#