TSNE#
- class cuml.manifold.TSNE(*, n_components=2, perplexity=30.0, early_exaggeration=12.0, late_exaggeration=1.0, learning_rate=200.0, max_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', metric_params=None, init='random', random_state=None, method='fft', angle=0.5, n_neighbors=90, perplexity_max_iter=100, exaggeration_iter=250, pre_momentum=0.5, post_momentum=0.8, learning_rate_method='adaptive', square_distances=True, precomputed_knn=None, verbose=False, output_type=None)#
t-SNE (T-Distributed Stochastic Neighbor Embedding) is an extremely powerful dimensionality reduction technique that aims to maintain local distances between data points. It is extremely robust to whatever dataset you give it, and is used in many areas including cancer research, music analysis and neural network weight visualizations.
cuML’s t-SNE supports three algorithms: the original exact algorithm, the Barnes-Hut approximation and the fast Fourier transform interpolation approximation. The latter two are derived from CannyLabs’ open-source CUDA code and produce extremely fast embeddings when n_components = 2. The exact algorithm is more accurate, but too slow to use on large datasets.
- Parameters:
- n_componentsint (default 2)
The output dimensionality size. Currently only 2 is supported.
- perplexityfloat (default 30.0)
Larger datasets require a larger value. Consider choosing different perplexity values from 5 to 50 and see the output differences.
- early_exaggerationfloat (default 12.0)
Controls the space between clusters. Not critical to tune this.
- late_exaggerationfloat (default 1.0)
Controls the space between clusters. It may be beneficial to increase this slightly to improve cluster separation. This will be applied after
exaggeration_iteriterations (FFT only).- learning_ratefloat (default 200.0)
The learning rate usually between (10, 1000). If this is too high, t-SNE could look like a cloud / ball of points.
- max_iterint (default 1000)
The more epochs, the more stable/accurate the final embedding.
- n_iter_without_progressint (default 300)
Currently unused. When the KL Divergence becomes too small after some iterations, terminate t-SNE early.
- min_grad_normfloat (default 1e-07)
The minimum gradient norm for when t-SNE will terminate early. Used in the ‘exact’ and ‘fft’ algorithms. Consider reducing if the embeddings are unsatisfactory. It’s recommended to use a smaller value for smaller datasets.
- metricstr (default=’euclidean’).
Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘correlation’]
- initstr ‘random’ or ‘pca’ (default ‘random’)
Currently supports random or pca initialization.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- random_stateint (default None)
Setting this can make repeated runs look more similar. Note, however, that this highly parallelized t-SNE implementation is not completely deterministic between runs, even with the same
random_state.- methodstr ‘fft’, ‘barnes_hut’ or ‘exact’ (default ‘fft’)
‘barnes_hut’ and ‘fft’ are fast approximations. ‘exact’ is more accurate but slower.
- anglefloat (default 0.5)
Valid values are between 0.0 and 1.0, which trade off speed and accuracy, respectively. Generally, these values are set between 0.2 and 0.8. (Barnes-Hut only.)
- learning_rate_methodstr ‘adaptive’, ‘none’ or None (default ‘adaptive’)
Either adaptive or None. ‘adaptive’ tunes the learning rate, early exaggeration, perplexity and n_neighbors automatically based on input size.
- n_neighborsint (default 90)
The number of datapoints you want to use in the attractive forces. Smaller values are better for preserving local structure, whilst larger values can improve global structure preservation. Default is 3 * 30 (perplexity)
- perplexity_max_iterint (default 100)
The number of epochs the best gaussian bands are found for.
- exaggeration_iterint (default 250)
To promote the growth of clusters, set this higher.
- pre_momentumfloat (default 0.5)
During the exaggeration iteration, more forcefully apply gradients.
- post_momentumfloat (default 0.8)
During the late phases, less forcefully apply gradients.
- square_distancesboolean, default=True
Whether TSNE should square the distance values. Internally, this will be used to compute a kNN graph using the provided metric and then squaring it when True. If a
knn_graphis passed tofitorfit_transformmethods, all the distances will be squared when True. For example, if aknn_graphwas obtained using ‘sqeuclidean’ metric, the distances will still be squared when True. Note: This argument should likely be set to False for distance metrics other than ‘euclidean’ and ‘l2’.- precomputed_knnarray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of TSNE and also allows the use of a custom distance function. This function should match the metric used to train the TSNE embeedings.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- embedding_array
Stores the embedding vectors.
kl_divergence_floatTSNE.kl_divergence_(self)
- learning_rate_float
Effective learning rate.
- n_iter_int
Number of iterations run.
Methods
fit(self, X[, y, convert_dtype, knn_graph])Fit X into an embedded space.
fit_transform(self, X[, y, convert_dtype, ...])Fit X into an embedded space and return that transformed output.
References
[2]van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
[3]George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding
Tip
Maaten and Linderman showcased how t-SNE can be very sensitive to both the starting conditions (i.e. random initialization), and how parallel versions of t-SNE can generate vastly different results between runs. You can run t-SNE multiple times to settle on the best configuration. Note that using the same random_state across runs does not guarantee similar results each time.
Note
The CUDA implementation is derived from the excellent CannyLabs open source implementation here: https://github.com/CannyLab/tsne-cuda/. The CannyLabs code is licensed according to the conditions in cuml/cpp/src/tsne/cannylabs_tsne_license.txt. A full description of their approach is available in their article t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data (https://arxiv.org/abs/1807.11824).
- fit(self, X, y=None, *, convert_dtype=True, knn_graph=None) 'TSNE'[source]#
Fit X into an embedded space.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- harray / sparse array / tuple, optional (device or host)
- Either one of a tuple (indices, distances) of
- arrays of shape (n_samples, n_neighbors), a pairwise distances
- dense array of shape (n_samples, n_samples) or a KNN graph
- sparse array (preferably CSR/COO). This feature allows
- the precomputation of the KNN outside of TSNE
- and also allows the use of a custom distance function. This function
- should match the metric used to train the TSNE embeedings.
- Takes precedence over the precomputed_knn parameter.
- fit_transform(self, X, y=None, *, convert_dtype=True, knn_graph=None) CumlArray[source]#
Fit X into an embedded space and return that transformed output.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Embedding of the data in low-dimensional space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- property kl_divergence_#