PCA#
- class cuml.decomposition.PCA(*, copy=True, iterated_power=15, n_components=None, svd_solver='auto', tol=1e-07, verbose=False, whiten=False, output_type=None)#
PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.
cuML’s PCA expects an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.
- Parameters:
- copyboolean (default = True)
If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.
- iterated_powerint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
- n_componentsint (default = None)
The number of top K singular vectors / values you want. Must be <= number(columns). If n_components is not set, then all components are kept:
n_components = min(n_samples, n_features)- svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
- tolfloat (default = 1e-7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- whitenboolean (default = False)
If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
- explained_variance_array
How much each component explains the variance in the data given by S**2
- explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
- singular_values_array
The top K singular values. Remember all singular values >= 0
- mean_array
The column wise mean of X. Used to mean - center the data first.
- noise_variance_float
From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.
Methods
fit(self, X[, y, convert_dtype])Fit the model with X.
fit_transform(self, X[, y])Fit the model with X and apply the dimensionality reduction on X.
inverse_transform(self, X, *[, ...])Transform data back to its original space.
transform(self, X, *[, convert_dtype])Apply dimensionality reduction to X.
Notes
PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.
Applications of PCA
PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.
For additional docs, see scikitlearn’s PCA.
Examples
>>> # Both import methods supported >>> from cuml import PCA >>> from cuml.decomposition import PCA >>> import cudf >>> import cupy as cp >>> gdf_float = cudf.DataFrame() >>> gdf_float['0'] = cp.asarray([1.0,2.0,5.0], dtype = cp.float32) >>> gdf_float['1'] = cp.asarray([4.0,2.0,1.0], dtype = cp.float32) >>> gdf_float['2'] = cp.asarray([4.0,2.0,1.0], dtype = cp.float32) >>> pca_float = PCA(n_components = 2) >>> pca_float.fit(gdf_float) PCA() >>> print(f'components: {pca_float.components_}') components: 0 1 2 0 0.69225764 -0.5102837 -0.51028395 1 -0.72165036 -0.48949987 -0.4895003 >>> print(f'explained variance: {pca_float.explained_variance_}') explained variance: 0 8.510... 1 0.489... dtype: float32 >>> exp_var = pca_float.explained_variance_ratio_ >>> print(f'explained variance ratio: {exp_var}') explained variance ratio: 0 0.9456... 1 0.054... dtype: float32 >>> print(f'singular values: {pca_float.singular_values_}') singular values: 0 4.125... 1 0.989... dtype: float32 >>> print(f'mean: {pca_float.mean_}') mean: 0 2.666... 1 2.333... 2 2.333... dtype: float32 >>> trans_gdf_float = pca_float.transform(gdf_float) >>> print(f'Inverse: {trans_gdf_float}') Inverse: 0 1 0 -2.8547091 -0.42891636 1 -0.121316016 0.80743366 2 2.9760244 -0.37851727 >>> input_gdf_float = pca_float.inverse_transform(trans_gdf_float) >>> print(f'Input: {input_gdf_float}') Input: 0 1 2 0 1.0 4.0 4.0 1 2.0 2.0 2.0 2 5.0 1.0 1.0
- fit(self, X, y=None, *, convert_dtype=True) 'PCA'[source]#
Fit the model with X. y is currently ignored.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- fit_transform(self, X, y=None) CumlArray[source]#
Fit the model with X and apply the dimensionality reduction on X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns:
- transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.
- inverse_transform(self, X, *, convert_dtype=False, return_sparse=False, sparse_tol=1e-10) CumlArray[source]#
Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = False)
When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- return_sparsebool, optional (default = False)
Ignored when the model is not fit on a sparse matrix If True, the method will convert the result to a cupyx.scipy.sparse.csr_matrix object. NOTE: Currently, there is a loss of information when converting to csr matrix (cusolver bug). Default will be switched to True once this is solved.
- sparse_tolfloat, optional (default = 1e-10)
Ignored when return_sparse=False. If True, values in the inverse transform below this parameter are clipped to 0.
- Returns:
- X_invcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_features)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.
- transform(self, X, *, convert_dtype=True) CumlArray[source]#
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.