PCA#

class cuml.dask.decomposition.PCA(*, client=None, verbose=False, **kwargs)[source]#

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s multi-node multi-GPU (MNMG) PCA expects a Dask cuDF object as input and provides 2 algorithms: Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm can be much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters:

n_componentsint (default = 1): The number of top K singular vectors / values you want. Must be <= number(columns).
svd_solver‘full’, ‘jacobi’, ‘auto’: ‘full’: Run exact full SVD and select the components by postprocessing ‘jacobi’: Iteratively compute SVD of the covariance matrix ‘auto’: For compatibility with Scikit-learn. Alias for ‘jacobi’.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
whitenboolean (default = False): If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Attributes:

components_array: The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
explained_variance_array: How much each component explains the variance in the data given by S**2
explained_variance_ratio_array: How much in % the variance is explained given by S**2/sum(S**2)
singular_values_array: The top K singular values. Remember all singular values >= 0
mean_array: The column wise mean of X. Used to mean - center the data first.
noise_variance_float: From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

`fit`(X)	Fit the model with X.
`fit_transform`(X)	Fit the model with X and apply the dimensionality reduction on X.
`inverse_transform`(X[, delayed])	Transform data back to its original space.
`transform`(X[, delayed])	Apply dimensionality reduction to X.

Notes

Known Limitation: The random_state parameter is not supported in the multi-node multi-GPU implementation. Results may vary slightly between runs.

PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For additional docs, see scikitlearn’s PCA.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client, wait
>>> import cupy as cp
>>> from cuml.dask.decomposition import PCA
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> nrows = 6
>>> ncols = 3
>>> n_parts = 2

>>> X_cudf, _ = make_blobs(n_samples=nrows, n_features=ncols,
...                        centers=1, n_parts=n_parts,
...                        cluster_std=0.01, random_state=10,
...                        dtype=cp.float32)

>>> blobs = X_cudf.compute()
>>> print(blobs)
[[8.688037  3.122401  1.2581943]
[8.705028  3.1070278 1.2705998]
[8.70239   3.1102846 1.2716919]
[8.695665  3.1042147 1.2635932]
[8.681095  3.0980906 1.2745825]
[8.705454  3.100002  1.2657361]]

>>> cumlModel = PCA(n_components = 1, whiten=False)
>>> XT = cumlModel.fit_transform(X_cudf)
>>> print(XT.compute())
[[-1.7516235e-02]
[ 7.8094802e-03]
[ 4.2757220e-03]
[-6.7228684e-05]
[-5.0618490e-03]
[ 1.0557819e-02]]
>>> client.close()
>>> cluster.close()

fit(X)[source]#

Fit the model with X.

Parameters:

Xdask cuDF input

fit_transform(X)[source]#

Fit the model with X and apply the dimensionality reduction on X.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF

inverse_transform(X, delayed=True)[source]#

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:

Xdask cuDF

Returns:

X_originaldask cuDF

transform(X, delayed=True)[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF