TruncatedSVD#
- class cuml.decomposition.TruncatedSVD(*, algorithm='full', n_components=1, n_iter=15, random_state=None, tol=1e-07, verbose=False, output_type=None)#
TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.
cuML’s TruncatedSVD an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.
- Parameters:
- algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
- n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
- n_iterint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
- random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
- tolfloat (default = 1e-7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
- explained_variance_array
How much each component explains the variance in the data given by S**2
- explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
- singular_values_array
The top K singular values. Remember all singular values >= 0
Methods
fit(self, X[, y])Fit model on training cudf DataFrame X.
fit_transform(self, X[, y, convert_dtype])Fit model to X and perform dimensionality reduction on X.
inverse_transform(self, X, *[, convert_dtype])Transform X back to its original space.
transform(self, X, *[, convert_dtype])Perform dimensionality reduction on X.
Notes
TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many, many components.
Applications of TruncatedSVD
TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.
For additional documentation, see scikitlearn’s TruncatedSVD docs.
Examples
>>> # Both import methods supported >>> from cuml import TruncatedSVD >>> from cuml.decomposition import TruncatedSVD >>> import cudf >>> import cupy as cp >>> gdf_float = cudf.DataFrame() >>> gdf_float['0'] = cp.asarray([1.0,2.0,5.0], dtype=cp.float32) >>> gdf_float['1'] = cp.asarray([4.0,2.0,1.0], dtype=cp.float32) >>> gdf_float['2'] = cp.asarray([4.0,2.0,1.0], dtype=cp.float32) >>> tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi", ... n_iter = 20, tol = 1e-9) >>> tsvd_float.fit(gdf_float) TruncatedSVD() >>> print(f'components: {tsvd_float.components_}') components: 0 1 2 0 0.587259 0.572331 0.572331 1 0.809399 -0.415255 -0.415255 >>> exp_var = tsvd_float.explained_variance_ >>> print(f'explained variance: {exp_var}') explained variance: 0 0.494... 1 5.505... dtype: float32 >>> exp_var_ratio = tsvd_float.explained_variance_ratio_ >>> print(f'explained variance ratio: {exp_var_ratio}') explained variance ratio: 0 0.082... 1 0.917... dtype: float32 >>> sing_values = tsvd_float.singular_values_ >>> print(f'singular values: {sing_values}') singular values: 0 7.439... 1 4.081... dtype: float32 >>> trans_gdf_float = tsvd_float.transform(gdf_float) >>> print(f'Transformed matrix: {trans_gdf_float}') Transformed matrix: 0 1 0 5.165910 -2.512643 1 3.463844 -0.042223 2 4.080960 3.216484 >>> input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float) >>> print(f'Input matrix: {input_gdf_float}') Input matrix: 0 1 2 0 1.0 4.0 4.0 1 2.0 2.0 2.0 2 5.0 1.0 1.0
- fit(self, X, y=None) 'TruncatedSVD'[source]#
Fit model on training cudf DataFrame X. y is currently ignored.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- fit_transform(self, X, y=None, *, convert_dtype=True) CumlArray[source]#
Fit model to X and perform dimensionality reduction on X. y is currently ignored.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Reduced version of X
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- inverse_transform(self, X, *, convert_dtype=False) CumlArray[source]#
Transform X back to its original space. Returns X_original whose transform would be X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = False)
When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- X_originalcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)
X in original space
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- transform(self, X, *, convert_dtype=True) CumlArray[source]#
Perform dimensionality reduction on X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns:
- X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Reduced version of X
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.