SparseRandomProjection#

class cuml.random_projection.SparseRandomProjection(n_components='auto', *, density='auto', eps=0.1, dense_output=False, random_state=None, output_type=None, verbose=False)[source]#

Reduce dimensionality through sparse random projection.

Sparse random matrix is an alternative to dense random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

If we note s = 1 / density the components of the random matrix are drawn from:

-sqrt(s) / sqrt(n_components)   with probability 1 / 2s
 0                              with probability 1 - 1 / s
+sqrt(s) / sqrt(n_components)   with probability 1 / 2s
Parameters:
n_componentsint or ‘auto’, default=’auto’

Dimensionality of the target projection space.

n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

densityfloat or ‘auto’, default=’auto’

Ratio in the range (0, 1] of non-zero component in the random projection matrix.

If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

epsfloat, default=0.1

Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. This value should be strictly positive.

Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

dense_outputbool, default=False

If True, ensure that the output of the random projection is a dense array even if the input and random projection matrix are both sparse. If False, the projected data uses a sparse representation if the input is sparse.

random_stateint, RandomState instance or None, default=None

Controls the pseudo random number generator used to generate the projection matrix at fit time.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:
n_components_int

Concrete number of components computed when n_components=”auto”.

components_sparse matrix of shape (n_components, n_features)

Random matrix used for the projection.

density_float in range 0.0 - 1.0

Concrete density computed from when density = “auto”.

n_features_in_int

Number of features seen during fit.

Notes

Inspired by Scikit-learn’s implementation: https://scikit-learn.org/stable/modules/random_projection.html

Currently passing a dense array to transform may result in close (but not exactly identical) results due to cupy/cupy#9323.

Examples

>>> from cuml.random_projection import SparseRandomProjection
>>> from cuml.datasets import make_blobs
>>> X, _ = make_blobs(n_samples=200, n_features=1000, random_state=42)
>>> model = SparseRandomProjection(n_components=50, random_state=42)
>>> X_new = model.fit_transform(X)
>>> X_new.shape
(200, 50)