PCA#
- class cuml.dask.decomposition.PCA(*, client=None, verbose=False, **kwargs)[source]#
PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.
cuML’s multi-node multi-GPU (MNMG) PCA expects a Dask cuDF object as input and provides 2 algorithms: Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm can be much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.
- Parameters:
- n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
- svd_solver‘full’, ‘jacobi’, ‘auto’
‘full’: Run exact full SVD and select the components by postprocessing ‘jacobi’: Iteratively compute SVD of the covariance matrix ‘auto’: For compatibility with Scikit-learn. Alias for ‘jacobi’.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- whitenboolean (default = False)
If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
- Attributes:
- components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
- explained_variance_array
How much each component explains the variance in the data given by S**2
- explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
- singular_values_array
The top K singular values. Remember all singular values >= 0
- mean_array
The column wise mean of X. Used to mean - center the data first.
- noise_variance_float
From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.
Methods
fit(X)Fit the model with X.
Fit the model with X and apply the dimensionality reduction on X.
inverse_transform(X[, delayed])Transform data back to its original space.
transform(X[, delayed])Apply dimensionality reduction to X.
Notes
Known Limitation: The
random_stateparameter is not supported in the multi-node multi-GPU implementation. Results may vary slightly between runs.PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.
Applications of PCA
PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.
For additional docs, see scikitlearn’s PCA.
Examples
>>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client, wait >>> import cupy as cp >>> from cuml.dask.decomposition import PCA >>> from cuml.dask.datasets import make_blobs >>> cluster = LocalCUDACluster(threads_per_worker=1) >>> client = Client(cluster) >>> nrows = 6 >>> ncols = 3 >>> n_parts = 2 >>> X_cudf, _ = make_blobs(n_samples=nrows, n_features=ncols, ... centers=1, n_parts=n_parts, ... cluster_std=0.01, random_state=10, ... dtype=cp.float32) >>> blobs = X_cudf.compute() >>> print(blobs) [[8.688037 3.122401 1.2581943] [8.705028 3.1070278 1.2705998] [8.70239 3.1102846 1.2716919] [8.695665 3.1042147 1.2635932] [8.681095 3.0980906 1.2745825] [8.705454 3.100002 1.2657361]] >>> cumlModel = PCA(n_components = 1, whiten=False) >>> XT = cumlModel.fit_transform(X_cudf) >>> print(XT.compute()) [[-1.7516235e-02] [ 7.8094802e-03] [ 4.2757220e-03] [-6.7228684e-05] [-5.0618490e-03] [ 1.0557819e-02]] >>> client.close() >>> cluster.close()
- fit_transform(X)[source]#
Fit the model with X and apply the dimensionality reduction on X.
- Parameters:
- Xdask cuDF
- Returns:
- X_newdask cuDF