PCA#

class cuml.dask.decomposition.PCA(*, client=None, verbose=False, **kwargs)[source]#

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s multi-node multi-GPU (MNMG) PCA expects a Dask cuDF object as input and provides 2 algorithms: Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm can be much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters:
n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

svd_solver‘full’, ‘jacobi’, ‘auto’

‘full’: Run exact full SVD and select the components by postprocessing ‘jacobi’: Iteratively compute SVD of the covariance matrix ‘auto’: For compatibility with Scikit-learn. Alias for ‘jacobi’.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

whitenboolean (default = False)

If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Attributes:
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

mean_array

The column wise mean of X. Used to mean - center the data first.

noise_variance_float

From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

fit(X)

Fit the model with X.

fit_transform(X)

Fit the model with X and apply the dimensionality reduction on X.

inverse_transform(X[, delayed])

Transform data back to its original space.

transform(X[, delayed])

Apply dimensionality reduction to X.

Notes

Known Limitation: The random_state parameter is not supported in the multi-node multi-GPU implementation. Results may vary slightly between runs.

PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For additional docs, see scikitlearn’s PCA.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client, wait
>>> import cupy as cp
>>> from cuml.dask.decomposition import PCA
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> nrows = 6
>>> ncols = 3
>>> n_parts = 2

>>> X_cudf, _ = make_blobs(n_samples=nrows, n_features=ncols,
...                        centers=1, n_parts=n_parts,
...                        cluster_std=0.01, random_state=10,
...                        dtype=cp.float32)

>>> blobs = X_cudf.compute()
>>> print(blobs)
[[8.688037  3.122401  1.2581943]
[8.705028  3.1070278 1.2705998]
[8.70239   3.1102846 1.2716919]
[8.695665  3.1042147 1.2635932]
[8.681095  3.0980906 1.2745825]
[8.705454  3.100002  1.2657361]]

>>> cumlModel = PCA(n_components = 1, whiten=False)
>>> XT = cumlModel.fit_transform(X_cudf)
>>> print(XT.compute())
[[-1.7516235e-02]
[ 7.8094802e-03]
[ 4.2757220e-03]
[-6.7228684e-05]
[-5.0618490e-03]
[ 1.0557819e-02]]
>>> client.close()
>>> cluster.close()
fit(X)[source]#

Fit the model with X.

Parameters:
Xdask cuDF input
fit_transform(X)[source]#

Fit the model with X and apply the dimensionality reduction on X.

Parameters:
Xdask cuDF
Returns:
X_newdask cuDF
inverse_transform(X, delayed=True)[source]#

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:
Xdask cuDF
Returns:
X_originaldask cuDF
transform(X, delayed=True)[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters:
Xdask cuDF
Returns:
X_newdask cuDF