API Reference#
Module Configuration#
Output Data Type Configuration#
- cuml.internals.memory_utils.set_global_output_type(output_type)[source]#
Method to set cuML’s single GPU estimators global output type. It will be used by all estimators unless overridden in their initialization with their own output_type parameter. Can also be overridden by the context manager method
using_output_type()
.
- Parameters
- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.
'input'
will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:
Input type
Output type
cuDF DataFrame or Series
cuDF DataFrame or Series
NumPy arrays
NumPy arrays
Pandas DataFrame or Series
NumPy arrays
Numba device arrays
Numba device arrays
CuPy arrays
CuPy arrays
Other
__cuda_array_interface__
objsCuPy arrays
'cudf'
will return cuDF Series for single dimensional results and DataFrames for the rest.
'cupy'
will return CuPy arrays.
'numpy'
will return NumPy arrays.Notes
'cupy'
and'numba'
options (as well as'input'
when using Numba and CuPy ndarrays for input) have the least overhead. cuDF add memory consumption and processing time needed to build the Series and DataFrames.'numpy'
has the biggest overhead due to the need to transfer data to CPU memory.Examples
>>> import cuml >>> import cupy as cp >>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]] >>> ary = cp.asarray(ary) >>> prev_output_type = cuml.global_settings.output_type >>> cuml.set_global_output_type('cudf') >>> dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1) >>> dbscan_float.fit(ary) DBSCAN() >>> >>> # cuML output type >>> dbscan_float.labels_ 0 0 1 1 2 2 dtype: int32 >>> type(dbscan_float.labels_) <class 'cudf.core.series.Series'> >>> cuml.set_global_output_type(prev_output_type)
- cuml.internals.memory_utils.using_output_type(output_type)[source]#
Context manager method to set cuML’s global output type inside a
with
statement. It gets reset to the prior value it had once thewith
code block is executer.
- Parameters
- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.
'input'
will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:
Input type
Output type
cuDF DataFrame or Series
cuDF DataFrame or Series
NumPy arrays
NumPy arrays
Pandas DataFrame or Series
NumPy arrays
Numba device arrays
Numba device arrays
CuPy arrays
CuPy arrays
Other
__cuda_array_interface__
objsCuPy arrays
'cudf'
will return cuDF Series for single dimensional results and DataFrames for the rest.
'cupy'
will return CuPy arrays.
'numpy'
will return NumPy arrays.Examples
>>> import cuml >>> import cupy as cp >>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]] >>> ary = cp.asarray(ary) >>> with cuml.using_output_type('cudf'): ... dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1) ... dbscan_float.fit(ary) ... ... print("cuML output inside 'with' context") ... print(dbscan_float.labels_) ... print(type(dbscan_float.labels_)) ... DBSCAN() cuML output inside 'with' context 0 0 1 1 2 2 dtype: int32 <class 'cudf.core.series.Series'> >>> # use cuml again outside the context manager >>> dbscan_float2 = cuml.DBSCAN(eps=1.0, min_samples=1) >>> dbscan_float2.fit(ary) DBSCAN() >>> # cuML default output >>> dbscan_float2.labels_ array([0, 1, 2], dtype=int32) >>> isinstance(dbscan_float2.labels_, cp.ndarray) True
CPU / GPU Device Selection (Experimental)#
cuML provides experimental support for running selected estimators and operators on either the GPU or CPU. This document covers the set of operators for which CPU/GPU device selection capabilities are supported as of the current nightly packages. If an operator isn’t listed here, it can only be run on the GPU. Prior versions of cuML may have reduced support compared to the following table.
Category |
Operator |
---|---|
Clustering |
HDBSCAN |
Dimensionality Reduction and Manifold Learning |
PCA |
Dimensionality Reduction and Manifold Learning |
TruncatedSVD |
Dimensionality Reduction and Manifold Learning |
UMAP |
Neighbors |
NearestNeighbors |
Regression and Classification |
ElasticNet |
Regression and Classification |
Lasso |
Regression and Classification |
LinearRegression |
Regression and Classification |
LogisticRegression |
Regression and Classification |
Ridge |
If a CUDA-enabled GPU is available on the system, cuML will default to using it. Users can configure CPU or GPU execution for supported operators via context managers or global configuration.
from cuml.linear_model import Lasso
from cuml.common.device_selection import using_device_type, set_global_device_type
with using_device_type("CPU"): # Alternatively, using_device_type("GPU")
model = Lasso()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# All operators supporting CPU execution will run on the CPU after this configuration
set_global_device_type("CPU")
model = Lasso()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
For more detailed examples, please see the Execution Device Interoperability Notebook in the User Guide.
Verbosity Levels#
cuML follows a verbosity model similar to Scikit-learn’s: The verbose parameter can be a boolean, or a numeric value, and higher numeric values mean more verbosity. The exact values can be set directly, or through the cuml.common.logger module, and they are:
Numeric value |
cuml.common.logger value |
Verbosity level |
---|---|---|
0 |
cuml.common.logger.level_off |
Disables all log messages |
1 |
cuml.common.logger.level_critical |
Enables only critical messages |
2 |
cuml.common.logger.level_error |
Enables all messages up to and including errors. |
3 |
cuml.common.logger.level_warn |
Enables all messages up to and including warnings. |
4 or False |
cuml.common.logger.level_info |
Enables all messages up to and including information messages. |
5 or True |
cuml.common.logger.level_debug |
Enables all messages up to and including debug messages. |
6 |
cuml.common.logger.level_trace |
Enables all messages up to and including trace messages. |
Preprocessing, Metrics, and Utilities#
Model Selection and Data Splitting#
- cuml.model_selection.train_test_split(X, y=None, test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, random_state: Optional[Union[int, RandomState, RandomState]] = None, stratify=None)[source]#
Partitions device data into four collated objects, mimicking Scikit-learn’s train_test_split.
- Parameters
- Xcudf.DataFrame or cuda_array_interface compliant device array
Data to split, has shape (n_samples, n_features)
- ystr, cudf.Series or cuda_array_interface compliant device array
Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X (if it is a cuDF DataFrame) containing the labels
- train_sizefloat or int, optional
If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8
- shufflebool, optional
Whether or not to shuffle inputs before splitting
- random_stateint, CuPy RandomState or NumPy RandomState optional
If shuffle is true, seeds the generator. Unseeded by default
- stratify: cudf.Series or cuda_array_interface compliant device array,
optional parameter. When passed, the input is split using this as column to startify on. Default=None
- Returns
- X_train, X_test, y_train, y_testcudf.DataFrame or array-like objects
Partitioned dataframes if X and y were cuDF objects. If
y
was provided as a column name, the column was dropped fromX
. Partitioned numba device arrays if X and y were Numba device arrays. Partitioned CuPy arrays for any other input.Examples
>>> import cudf >>> from cuml.model_selection import train_test_split >>> # Generate some sample data >>> df = cudf.DataFrame({'x': range(10), ... 'y': [0, 1] * 5}) >>> print(f'Original data: {df.shape[0]} elements') Original data: 10 elements >>> # Suppose we want an 80/20 split >>> X_train, X_test, y_train, y_test = train_test_split(df, 'y', ... train_size=0.8) >>> print(f'X_train: {X_train.shape[0]} elements') X_train: 8 elements >>> print(f'X_test: {X_test.shape[0]} elements') X_test: 2 elements >>> print(f'y_train: {y_train.shape[0]} elements') y_train: 8 elements >>> print(f'y_test: {y_test.shape[0]} elements') y_test: 2 elements >>> # Alternatively, if our labels are stored separately >>> labels = df['y'] >>> df = df.drop(['y'], axis=1) >>> # we can also do >>> X_train, X_test, y_train, y_test = train_test_split(df, labels, ... train_size=0.8)
Feature and Label Encoding (Single-GPU)#
- class cuml.preprocessing.LabelEncoder.LabelEncoder(*, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]#
An nvcategory based implementation of ordinal label encoding
- Parameters
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.Examples
Converting a categorical implementation to a numerical one
>>> from cudf import DataFrame, Series >>> from cuml.preprocessing import LabelEncoder >>> data = DataFrame({'category': ['a', 'b', 'c', 'd']})>>> # There are two functionally equivalent ways to do this >>> le = LabelEncoder() >>> le.fit(data.category) # le = le.fit(data.category) also works LabelEncoder() >>> encoded = le.transform(data.category)>>> print(encoded) 0 0 1 1 2 2 3 3 dtype: uint8>>> # This method is preferred >>> le = LabelEncoder() >>> encoded = le.fit_transform(data.category)>>> print(encoded) 0 0 1 1 2 2 3 3 dtype: uint8>>> # We can assign this to a new column >>> data = data.assign(encoded=encoded) >>> print(data.head()) category encoded 0 a 0 1 b 1 2 c 2 3 d 3>>> # We can also encode more data >>> test_data = Series(['c', 'a']) >>> encoded = le.transform(test_data) >>> print(encoded) 0 2 1 0 dtype: uint8>>> # After train, ordinal label can be inverse_transform() back to >>> # string labels >>> ord_label = cudf.Series([0, 0, 1, 2, 1]) >>> str_label = le.inverse_transform(ord_label) >>> print(str_label) 0 a 1 a 2 b 3 c 4 b dtype: objectMethods
fit
(y[, _classes])Fit a LabelEncoder (nvcategory) instance to a set of categories
fit_transform
(y[, z])Simultaneously fit and transform an input
Returns a list of hyperparameter names owned by this class.
Revert ordinal label to original label
transform
(y)Transform an input into its categorical keys.
- fit(y, _classes=None)[source]#
Fit a LabelEncoder (nvcategory) instance to a set of categories
- Parameters
- ycudf.Series, pandas.Series, cupy.ndarray or numpy.ndarray
Series containing the categories to be encoded. It’s elements may or may not be unique
- _classes: int or None.
Passed by the dask client when dask LabelEncoder is used.
- Returns
- selfLabelEncoder
A fitted instance of itself to allow method chaining
- fit_transform(y, z=None) Series [source]#
Simultaneously fit and transform an input
This is functionally equivalent to (but faster than)
LabelEncoder().fit(y).transform(y)
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(y: Series) Series [source]#
Revert ordinal label to original label
- Parameters
- ycudf.Series, pandas.Series, cupy.ndarray or numpy.ndarray
dtype=int32 Ordinal labels to be reverted
- Returns
- revertedthe same type as y
Reverted labels
- transform(y) Series [source]#
Transform an input into its categorical keys.
This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer
fit_transform
.
- class cuml.preprocessing.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False, handle=None, verbose=False, output_type=None)[source]#
A multi-class dummy encoder for labels.
- Parameters
- neg_labelinteger (default=0)
label to be used as the negative binary label
- pos_labelinteger (default=1)
label to be used as the positive binary label
- sparse_outputbool (default=False)
whether to return sparse arrays for transformed output
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.Examples
Create an array with labels and dummy encode them
>>> import cupy as cp >>> import cupyx >>> from cuml.preprocessing import LabelBinarizer >>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1], ... dtype=cp.int32) >>> lb = LabelBinarizer() >>> encoded = lb.fit_transform(labels) >>> print(str(encoded)) [[1 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0] [0 0 1 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 1 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0] [0 1 0 0 0 0 0 0]] >>> decoded = lb.inverse_transform(encoded) >>> print(str(decoded)) [ 0 5 10 7 2 4 1 0 0 4 3 2 1]
- Attributes
- classes_
Methods
fit
(y)Fit label binarizer
Fit label binarizer and transform multi-class labels to their dummy-encoded representation.
Returns a list of hyperparameter names owned by this class.
inverse_transform
(y[, threshold])Transform binary labels back to original multi-class labels
transform
(y)Transform multi-class labels to their dummy-encoded representation labels.
- fit(y) LabelBinarizer [source]#
Fit label binarizer
- Parameters
- yarray of shape [n_samples,] or [n_samples, n_classes]
Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
- Returns
- selfreturns an instance of self.
- fit_transform(y) SparseCumlArray [source]#
Fit label binarizer and transform multi-class labels to their dummy-encoded representation.
- Parameters
- yarray of shape [n_samples,] or [n_samples, n_classes]
- Returns
- arrarray with encoded labels
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- cuml.preprocessing.label_binarize(y, classes, neg_label=0, pos_label=1, sparse_output=False) SparseCumlArray [source]#
A stateless helper function to dummy encode multi-class labels.
- Parameters
- yarray-like of size [n_samples,] or [n_samples, n_classes]
- classesthe set of unique classes in the input
- neg_labelinteger the negative value for transformed output
- pos_labelinteger the positive value for transformed output
- sparse_outputbool whether to return sparse array
- class cuml.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse='deprecated', sparse_output=True, dtype=<class 'numpy.float32'>, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]#
Encode categorical features as a one-hot numeric array. The input to this estimator should be a
cuDF.DataFrame
or acupy.ndarray
, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on thesparse
parameter).By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the
categories
manually.Note
a one-hot encoding of y labels should use a LabelBinarizer instead.
- Parameters
- categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]
holds the categories expected in the feature col.- drop‘first’, None, a dict or a list, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
dict/list :
drop[col]
is the category in feature col that should be dropped.- sparse_outputbool, default=True
This feature is not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See cupy/cupy#3223
New in version 24.06:
sparse
was renamed tosparse_output
- sparsebool, default=True
Will return sparse matrix if set True else will return an array.
Deprecated since version 24.06:
sparse
is deprecated in 24.06 and will be removed in 25.08. Usesparse_output
instead.- dtypenumber type, default=np.float
Desired datatype of transform’s output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- Attributes
- drop_idx_array of shape (n_features,)
drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature. None if all the transformed features will be retained.Methods
fit
(X[, y])Fit OneHotEncoder to X.
fit_transform
(X[, y])Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).
get_feature_names
([input_features])Return feature names for output features.
Returns a list of hyperparameter names owned by this class.
Convert the data back to the original representation.
transform
(X)Transform X using one-hot encoding.
- fit(X, y=None)[source]#
Fit OneHotEncoder to X. Parameters ———-
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yNone
Ignored. This parameter exists for compatibility only.
- fit_transform(X, y=None)[source]#
Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yNone
Ignored. This parameter exists for compatibility only.
- Returns
- X_outsparse matrix if sparse=True else a 2-d array
Transformed input.
- get_feature_names(input_features=None)[source]#
Return feature names for output features.
- Parameters
- input_featureslist of str of shape (n_features,)
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.
- Returns
- output_feature_namesndarray of shape (n_output_features,)
Array of feature names.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X)[source]#
Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category.The return type is the same as the type of the input used by the first call to fit on this estimator instance.
- Parameters
- Xarray-like or sparse matrix, shape [n_samples, n_encoded_features]
The transformed data.
- Returns
- X_trcudf.DataFrame or cupy.ndarray
Inverse transformed array.
- transform(X)[source]#
Transform X using one-hot encoding. Parameters ———-
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns
- X_outsparse matrix if sparse=True else a 2-d array
Transformed input.
- class cuml.preprocessing.TargetEncoder.TargetEncoder(n_folds=4, smooth=0, seed=42, split_method='interleaved', output_type='auto', stat='mean')[source]#
A cudf based implementation of target encoding [1], which converts one or multiple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns
Xs
and the aggregated mean value ofY
of each group is calculated to replace each value ofXs
. Several optimizations are applied to prevent label leakage and parallelize the execution.
- Parameters
- n_foldsint (default=4)
Default number of folds for fitting training data. To prevent label leakage in
fit
, we split data inton_folds
and encode one fold using the target variables of the remaining folds.- smoothint or float (default=0)
Count of samples to smooth the encoding. 0 means no smoothing.
- seedint (default=42)
Random seed
- split_method{‘random’, ‘continuous’, ‘interleaved’}, (default=’interleaved’)
Method to split train data into
n_folds
. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. ‘customize’: customize splitting by providing afold_ids
array infit()
orfit_transform()
functions.- output_type{‘cupy’, ‘numpy’, ‘auto’}, default = ‘auto’
The data type of output. If ‘auto’, it matches input data.
- stat{‘mean’,’var’,’median’}, default = ‘mean’
The statistic used in encoding, mean, variance or median of the target.
References
Examples
Converting a categorical implementation to a numerical one
>>> from cudf import DataFrame, Series >>> from cuml.preprocessing import TargetEncoder >>> train = DataFrame({'category': ['a', 'b', 'b', 'a'], ... 'label': [1, 0, 1, 1]}) >>> test = DataFrame({'category': ['a', 'c', 'b', 'a']})>>> encoder = TargetEncoder() >>> train_encoded = encoder.fit_transform(train.category, train.label) >>> test_encoded = encoder.transform(test.category) >>> print(train_encoded) [1. 1. 0. 1.] >>> print(test_encoded) [1. 0.75 0.5 1. ]Methods
fit
(x, y[, fold_ids])Fit a TargetEncoder instance to a set of categories
fit_transform
(x, y[, fold_ids])Simultaneously fit and transform an input
get_params
([deep])Returns a dict of all params owned by this class.
transform
(x)Transform an input into its categorical keys.
get_param_names
- fit(x, y, fold_ids=None)[source]#
Fit a TargetEncoder instance to a set of categories
- Parameters
- xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique
- ycudf.Series or cupy.ndarray
Series containing the target variable.
- fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range
[0, N-1]
to split data intoN
folds. If None, fold_ids is generated based onsplit_method
.- Returns
- selfTargetEncoder
A fitted instance of itself to allow method chaining
- fit_transform(x, y, fold_ids=None)[source]#
Simultaneously fit and transform an input
This is functionally equivalent to (but faster than)
TargetEncoder().fit(y).transform(y)
- Parameters
- xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique
- ycudf.Series or cupy.ndarray
Series containing the target variable.
- fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range
[0, N-1]
to split data intoN
folds. If None, fold_ids is generated based onsplit_method
.- Returns
- encodedcupy.ndarray
The ordinally encoded input series
- transform(x)[source]#
Transform an input into its categorical keys.
This is intended for test data. For fitting and transforming the training data, prefer
fit_transform
.
- Parameters
- xcudf.Series
Input keys to be transformed. Its values doesn’t have to match the categories given to
fit
- Returns
- encodedcupy.ndarray
The ordinally encoded input series
Feature Scaling and Normalization (Single-GPU)#
- class cuml.preprocessing.MaxAbsScaler(*args, **kwargs)[source]#
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
- Parameters
- copyboolean, optional, default is True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
maxabs_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
Examples
>>> from cuml.preprocessing import MaxAbsScaler >>> import cupy as cp >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> X = cp.array(X) >>> transformer = MaxAbsScaler().fit(X) >>> transformer MaxAbsScaler() >>> transformer.transform(X) array([[ 0.5, -1. , 1. ], [ 1. , 0. , 0. ], [ 0. , 1. , -0.5]])
- Attributes
- scale_ndarray, shape (n_features,)
Per feature relative scaling of the data.
- max_abs_ndarray, shape (n_features,)
Per feature maximum absolute value.
- n_samples_seen_int
The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across
partial_fit
calls.
Methods
fit
(X[, y])Compute the maximum absolute value to be used for later scaling.
Returns a list of hyperparameter names owned by this class.
Scale back the data to the original representation
partial_fit
(X[, y])Online computation of max absolute value of X for later scaling.
transform
(X)Scale the data
- fit(X, y=None) MaxAbsScaler [source]#
Compute the maximum absolute value to be used for later scaling.
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X) SparseCumlArray [source]#
Scale back the data to the original representation
- Parameters
- X{array-like, sparse matrix}
The data that should be transformed back.
- partial_fit(X, y=None) MaxAbsScaler [source]#
Online computation of max absolute value of X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number ofn_samples
or because X is read from a continuous stream.- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.
- yNone
Ignored.
- Returns
- selfobject
Transformer instance.
- class cuml.preprocessing.MinMaxScaler(*args, **kwargs)[source]#
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min
where min, max = feature_range.
This transformation is often used as an alternative to zero mean, unit variance scaling.
- Parameters
- feature_rangetuple (min, max), default=(0, 1)
Desired range of transformed data.
- copybool, default=True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
minmax_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
Examples
>>> from cuml.preprocessing import MinMaxScaler >>> import cupy as cp >>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] >>> data = cp.array(data) >>> scaler = MinMaxScaler() >>> print(scaler.fit(data)) MinMaxScaler() >>> print(scaler.data_max_) [ 1. 18.] >>> print(scaler.transform(data)) [[0. 0. ] [0.25 0.25] [0.5 0.5 ] [1. 1. ]] >>> print(scaler.transform(cp.array([[2, 2]]))) [[1.5 0. ]]
- Attributes
- min_ndarray of shape (n_features,)
Per feature adjustment for minimum. Equivalent to
min - X.min(axis=0) * self.scale_
- scale_ndarray of shape (n_features,)
Per feature relative scaling of the data. Equivalent to
(max - min) / (X.max(axis=0) - X.min(axis=0))
- data_min_ndarray of shape (n_features,)
Per feature minimum seen in the data
- data_max_ndarray of shape (n_features,)
Per feature maximum seen in the data
- data_range_ndarray of shape (n_features,)
Per feature range
(data_max_ - data_min_)
seen in the data- n_samples_seen_int
The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across
partial_fit
calls.
Methods
fit
(X[, y])Compute the minimum and maximum to be used for later scaling.
Returns a list of hyperparameter names owned by this class.
Undo the scaling of X according to feature_range.
partial_fit
(X[, y])Online computation of min and max on X for later scaling.
transform
(X)Scale features of X according to feature_range.
- fit(X, y=None) MinMaxScaler [source]#
Compute the minimum and maximum to be used for later scaling.
- Parameters
- Xarray-like of shape (n_samples, n_features)
The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
- yNone
Ignored.
- Returns
- selfobject
Fitted scaler.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X) CumlArray [source]#
Undo the scaling of X according to feature_range.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data that will be transformed. It cannot be sparse.
- Returns
- Xtarray-like of shape (n_samples, n_features)
Transformed data.
- partial_fit(X, y=None) MinMaxScaler [source]#
Online computation of min and max on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number ofn_samples
or because X is read from a continuous stream.- Parameters
- Xarray-like of shape (n_samples, n_features)
The data used to compute the mean and standard deviation used for later scaling along the features axis.
- yNone
Ignored.
- Returns
- selfobject
Transformer instance.
- class cuml.preprocessing.Normalizer(*args, **kwargs)[source]#
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.
This transformer is able to work both with dense numpy arrays and sparse matrix
Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.
- Parameters
- norm‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
normalize
Equivalent function without the estimator API.
Notes
This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.
Examples
>>> from cuml.preprocessing import Normalizer >>> import cupy as cp >>> X = [[4, 1, 2, 2], ... [1, 3, 9, 3], ... [5, 7, 5, 1]] >>> X = cp.array(X) >>> transformer = Normalizer().fit(X) # fit does nothing. >>> transformer Normalizer() >>> transformer.transform(X) array([[0.8, 0.2, 0.4, 0.4], [0.1, 0.3, 0.9, 0.3], [0.5, 0.7, 0.5, 0.1]])
Methods
fit
(X[, y])Do nothing and return the estimator unchanged
transform
(X[, copy])Scale each non zero row of X to unit norm
- fit(X, y=None) Normalizer [source]#
Do nothing and return the estimator unchanged
This method is just there to implement the usual API and hence work in pipelines.
- Parameters
- X{array-like, CSR matrix}
- transform(X, copy=None) SparseCumlArray [source]#
Scale each non zero row of X to unit norm
- Parameters
- X{array-like, CSR matrix}, shape [n_samples, n_features]
The data to normalize, row by row.
- copybool, optional (default: None)
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- class cuml.preprocessing.RobustScaler(*args, **kwargs)[source]#
Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the
transform
method.Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
- Parameters
- with_centeringboolean, default=True
If True, center the data before scaling. This will cause
transform
to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.- with_scalingboolean, default=True
If True, scale the data to interquartile range.
- quantile_rangetuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate
scale_
.- copyboolean, optional, default=True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
robust_scale
Equivalent function without the estimator API.
cuml.decomposition.PCA
Further removes the linear correlation across features with
whiten=True
.
Examples
>>> from cuml.preprocessing import RobustScaler >>> import cupy as cp >>> X = [[ 1., -2., 2.], ... [ -2., 1., 3.], ... [ 4., 1., -2.]] >>> X = cp.array(X) >>> transformer = RobustScaler().fit(X) >>> transformer RobustScaler() >>> transformer.transform(X) array([[ 0. , -2. , 0. ], [-1. , 0. , 0.4], [ 1. , 0. , -1.6]])
- Attributes
- center_array of floats
The median value for each feature in the training set.
- scale_array of floats
The (scaled) interquartile range for each feature in the training set.
Methods
fit
(X[, y])Compute the median and quantiles to be used for scaling.
Returns a list of hyperparameter names owned by this class.
Scale back the data to the original representation
transform
(X)Center and scale the data.
- fit(X, y=None) RobustScaler [source]#
Compute the median and quantiles to be used for scaling.
- Parameters
- X{array-like, CSC matrix}, shape [n_samples, n_features]
The data used to compute the median and quantiles used for later scaling along the features axis.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- class cuml.preprocessing.StandardScaler(*args, **kwargs)[source]#
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample
x
is calculated as:z = (x - u) / s
where
u
is the mean of the training samples or zero ifwith_mean=False
, ands
is the standard deviation of the training samples or one ifwith_std=False
.Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using
transform()
.Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing
with_mean=False
to avoid breaking the sparsity structure of the data.- Parameters
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- with_meanboolean, True by default
If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
- with_stdboolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
See also
scale
Equivalent function without the estimator API.
cuml.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0)
. Note that the choice ofddof
is unlikely to affect model performance.Examples
>>> from cuml.preprocessing import StandardScaler >>> import cupy as cp >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> data = cp.array(data) >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.] [-1. -1.] [ 1. 1.] [ 1. 1.]] >>> print(scaler.transform(cp.array([[2, 2]]))) [[3. 3.]]
- Attributes
- scale_ndarray or None, shape (n_features,)
Per feature relative scaling of the data. This is calculated using
sqrt(var_)
. Equal toNone
whenwith_std=False
.- mean_ndarray or None, shape (n_features,)
The mean value for each feature in the training set. Equal to
None
whenwith_mean=False
.- var_ndarray or None, shape (n_features,)
The variance for each feature in the training set. Used to compute
scale_
. Equal toNone
whenwith_std=False
.- n_samples_seen_int or array, shape (n_features,)
The number of samples processed by the estimator for each feature. If there are not missing samples, the
n_samples_seen
will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments acrosspartial_fit
calls.
Methods
fit
(X[, y])Compute the mean and std to be used for later scaling.
Returns a list of hyperparameter names owned by this class.
inverse_transform
(X[, copy])Scale back the data to the original representation
partial_fit
(X[, y])Online computation of mean and std on X for later scaling.
transform
(X[, copy])Perform standardization by centering and scaling
- fit(X, y=None) StandardScaler [source]#
Compute the mean and std to be used for later scaling.
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.
- yNone
Ignored
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X, copy=None) SparseCumlArray [source]#
Scale back the data to the original representation
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to scale along the features axis.
- copybool, optional (default: None)
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- Returns
- X_tr{array-like, sparse matrix}, shape [n_samples, n_features]
Transformed array.
- partial_fit(X, y=None) StandardScaler [source]#
Online computation of mean and std on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number ofn_samples
or because X is read from a continuous stream.The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.
- yNone
Ignored.
- Returns
- selfobject
Transformer instance.
- transform(X, copy=None) SparseCumlArray [source]#
Perform standardization by centering and scaling
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data used to scale along the features axis.
- copybool, optional (default: None)
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- cuml.preprocessing.maxabs_scale(X, *, axis=0, copy=True)[source]#
Scale each feature to the [-1, 1] range without breaking the sparsity.
This estimator scales each feature individually such that the maximal absolute value of each feature in the training set will be 1.0.
This scaler can also be applied to sparse CSR or CSC matrices.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
The data.
- axisint (0 by default)
axis used to scale along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
- copyboolean, optional, default is True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
MaxAbsScaler
Performs scaling to the [-1, 1] range using the``Transformer`` API
Notes
NaNs are treated as missing values: disregarded to compute the statistics, and maintained during the data transformation.
- cuml.preprocessing.minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True)[source]#
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.
The transformation is given by (when
axis=0
):X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min
where min, max = feature_range.
The transformation is calculated as (when
axis=0
):X_scaled = scale * X + min - X.min(axis=0) * scale where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
This transformation is often used as an alternative to zero mean, unit variance scaling.
- Parameters
- Xarray-like of shape (n_samples, n_features)
The data.
- feature_rangetuple (min, max), default=(0, 1)
Desired range of transformed data.
- axisint, default=0
Axis used to scale along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
- copybool, default=True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
MinMaxScaler
Performs scaling to a given range using the``Transformer`` API
- cuml.preprocessing.normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)[source]#
Scale input vectors individually to unit norm (vector length).
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data to normalize, element by element. Please provide CSC matrix to normalize on axis 0, conversely provide CSR matrix to normalize on axis 1
- norm‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
- axis0 or 1, optional (1 by default)
axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- return_normboolean, default False
whether to return the computed norms
- Returns
- X{array-like, sparse matrix}, shape [n_samples, n_features]
Normalized input X.
- normsarray, shape [n_samples] if axis=1 else [n_features]
An array of norms along given axis for X. When X is sparse, a NotImplementedError will be raised for norm ‘l1’ or ‘l2’.
See also
Normalizer
Performs normalization using the
Transformer
API
- cuml.preprocessing.robust_scale(X, *, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)[source]#
Standardize a dataset along any axis
Center to the median and component wise scale according to the interquartile range.
- Parameters
- X{array-like, sparse matrix}
The data to center and scale.
- axisint (0 by default)
axis used to compute the medians and IQR along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
- with_centeringboolean, True by default
If True, center the data before scaling.
- with_scalingboolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
- quantile_rangetuple (q_min, q_max), 0.0 < q_min < q_max < 100.0
Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate
scale_
.- copyboolean, optional, default is True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
RobustScaler
Performs centering and scaling using the
Transformer
API
Notes
This implementation will refuse to center sparse matrices since it would make them non-sparse and would potentially crash the program with memory exhaustion problems.
Instead the caller is expected to either set explicitly
with_centering=False
(in that case, only variance scaling will be performed on the features of the CSR matrix) or to densify the matrix if he/she expects the materialized dense array to fit in memory.To avoid memory copy the caller should pass a CSR matrix.
- cuml.preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)[source]#
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
- Parameters
- X{array-like, sparse matrix}
The data to center and scale.
- axisint (0 by default)
axis used to compute the means and standard deviations along. If 0, independently standardize each feature, otherwise (if 1) standardize each sample.
- with_meanboolean, True by default
If True, center the data before scaling.
- with_stdboolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
StandardScaler
Performs scaling to unit variance using the``Transformer`` API
Notes
This implementation will refuse to center sparse matrices since it would make them non-sparse and would potentially crash the program with memory exhaustion problems.
Instead the caller is expected to either set explicitly
with_mean=False
(in that case, only variance scaling will be performed on the features of the sparse matrix) or to densify the matrix if he/she expects the materialized dense array to fit in memory.For optimal processing the caller should pass a CSC matrix.
NaNs are treated as missing values: disregarded to compute the statistics, and maintained during the data transformation.
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0)
. Note that the choice ofddof
is unlikely to affect model performance.
Other preprocessing methods (Single-GPU)#
- class cuml.preprocessing.Binarizer(*args, **kwargs)[source]#
Binarize data (set feature values to 0 or 1) according to a threshold
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.
Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.
It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).
- Parameters
- thresholdfloat, optional (0.0 by default)
Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
binarize
Equivalent function without the estimator API.
Notes
If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.
This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.
Examples
>>> from cuml.preprocessing import Binarizer >>> import cupy as cp >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> X = cp.array(X) >>> transformer = Binarizer().fit(X) # fit does nothing. >>> transformer Binarizer() >>> transformer.transform(X) array([[1., 0., 1.], [1., 0., 0.], [0., 1., 0.]])
Methods
fit
(X[, y])Do nothing and return the estimator unchanged
transform
(X[, copy])Binarize each element of X
- class cuml.preprocessing.FunctionTransformer(*args, **kwargs)[source]#
Constructs a transformer from an arbitrary callable.
A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.
Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.
- Parameters
- funccallable, default=None
The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.
- inverse_funccallable, default=None
The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.
- accept_sparsebool, default=False
Indicate that func accepts a sparse matrix as input. Otherwise, if accept_sparse is false, sparse matrix inputs will cause an exception to be raised.
- check_inversebool, default=True
Whether to check that or
func
followed byinverse_func
leads to the original inputs. It can be used for a sanity check, raising a warning when the condition is not fulfilled.- kw_argsdict, default=None
Dictionary of additional keyword arguments to pass to func.
- inv_kw_argsdict, default=None
Dictionary of additional keyword arguments to pass to inverse_func.
Examples
>>> import cupy as cp >>> from cuml.preprocessing import FunctionTransformer >>> transformer = FunctionTransformer(func=cp.log1p) >>> X = cp.array([[0, 1], [2, 3]]) >>> transformer.transform(X) array([[0. , 0.6931...], [1.0986..., 1.3862...]])
Methods
fit
(X[, y])Fit transformer by checking X.
Transform X using the inverse function.
transform
(X)Transform X using the forward function.
- fit(X, y=None) FunctionTransformer [source]#
Fit transformer by checking X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Input array.
- Returns
- self
- class cuml.preprocessing.KBinsDiscretizer(*args, **kwargs)[source]#
Bin continuous data into intervals.
- Parameters
- n_binsint or array-like, shape (n_features,) (default=5)
The number of bins to produce. Raises ValueError if
n_bins < 2
.- encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, (default=’onehot’)
Method used to encode the transformed result.
- onehot
Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
- onehot-dense
Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
- ordinal
Return the bin identifier encoded as an integer value.
- strategy{‘uniform’, ‘quantile’, ‘kmeans’}, (default=’quantile’)
Strategy used to define the widths of the bins.
- uniform
All bins in each feature have identical widths.
- quantile
All bins in each feature have the same number of points.
- kmeans
Values in each bin have the same nearest center of a 1D k-means cluster.
See also
cuml.preprocessing.Binarizer
Class used to bin values as
0
or1
based on a parameterthreshold
.
Notes
In bin edges for feature
i
, the first and last values are used only forinverse_transform
. During transform, bin edges are extended to:np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])
You can combine
KBinsDiscretizer
withcuml.compose.ColumnTransformer
if you only want to preprocess part of the features.KBinsDiscretizer
might produce constant features (e.g., whenencode = 'onehot'
and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g.,sklearn.feature_selection.VarianceThreshold
).Examples
>>> from cuml.preprocessing import KBinsDiscretizer >>> import cupy as cp >>> X = [[-2, 1, -4, -1], ... [-1, 2, -3, -0.5], ... [ 0, 3, -2, 0.5], ... [ 1, 4, -1, 2]] >>> X = cp.array(X) >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') >>> est.fit(X) KBinsDiscretizer(...) >>> Xt = est.transform(X) >>> Xt array([[0, 0, 0, 0], [1, 1, 1, 0], [2, 2, 2, 1], [2, 2, 2, 2]], dtype=int32)
Sometimes it may be useful to convert the data back into the original feature space. The
inverse_transform
function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.>>> est.bin_edges_[0] array([-2., -1., 0., 1.]) >>> est.inverse_transform(Xt) array([[-1.5, 1.5, -3.5, -0.5], [-0.5, 2.5, -2.5, -0.5], [ 0.5, 3.5, -1.5, 0.5], [ 0.5, 3.5, -1.5, 1.5]])
- Attributes
- n_bins_int array, shape (n_features,)
Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
- bin_edges_array of arrays, shape (n_features, )
The edges of each bin. Contain arrays of varying shapes
(n_bins_, )
Ignored features will have empty arrays.
Methods
fit
(X[, y])Fit the estimator.
Returns a list of hyperparameter names owned by this class.
Transform discretized data back to original feature space.
transform
(X)Discretize the data.
- fit(X, y=None) KBinsDiscretizer [source]#
Fit the estimator.
- Parameters
- Xnumeric array-like, shape (n_samples, n_features)
Data to be discretized.
- yNone
Ignored. This parameter exists only for compatibility with
sklearn.pipeline.Pipeline
.
- Returns
- self
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(Xt) SparseCumlArray [source]#
Transform discretized data back to original feature space.
Note that this function does not regenerate the original data due to discretization rounding.
- Parameters
- Xtnumeric array-like, shape (n_sample, n_features)
Transformed data in the binned space.
- Returns
- Xinvnumeric array-like
Data in the original feature space.
- class cuml.preprocessing.KernelCenterer(*args, **kwargs)[source]#
Center a kernel matrix
Let K(x, z) be a kernel defined by phi(x)^T phi(z), where phi is a function mapping x to a Hilbert space. KernelCenterer centers (i.e., normalize to have zero mean) the data without explicitly computing phi(x). It is equivalent to centering phi(x) with cuml.preprocessing.StandardScaler(with_std=False).
Examples
>>> import cupy as cp >>> from cuml.preprocessing import KernelCenterer >>> from cuml.metrics import pairwise_kernels >>> X = cp.array([[ 1., -2., 2.], ... [ -2., 1., 3.], ... [ 4., 1., -2.]]) >>> K = pairwise_kernels(X, metric='linear') >>> K array([[ 9., 2., -2.], [ 2., 14., -13.], [ -2., -13., 21.]]) >>> transformer = KernelCenterer().fit(K) >>> transformer KernelCenterer() >>> transformer.transform(K) array([[ 5., 0., -5.], [ 0., 14., -14.], [ -5., -14., 19.]])
- Attributes
- K_fit_rows_array, shape (n_samples,)
Average of each column of kernel matrix
- K_fit_all_float
Average of kernel matrix
Methods
- fit(K, y=None) KernelCenterer [source]#
Fit KernelCenterer
- Parameters
- Knumpy array of shape [n_samples, n_samples]
Kernel matrix.
- Returns
- selfreturns an instance of self.
- transform(K, copy=True) CumlArray [source]#
Center kernel matrix.
- Parameters
- Knumpy array of shape [n_samples1, n_samples2]
Kernel matrix.
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
- Returns
- K_newnumpy array of shape [n_samples1, n_samples2]
- class cuml.preprocessing.MissingIndicator(*args, **kwargs)[source]#
Binary indicators for missing values.
Note that this component typically should not be used in a vanilla
Pipeline
consisting of transformers and a classifier, but rather could be added using aFeatureUnion
orColumnTransformer
.- Parameters
- missing_valuesnumber, string, np.nan (default) or None
The placeholder for the missing values. All occurrences of
missing_values
will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_values
should be set tonp.nan
, sincepd.NA
will be converted tonp.nan
.- featuresstr, default=None
Whether the imputer mask should represent all or a subset of features.
If “missing-only” (default), the imputer mask will only represent features containing missing values during fit time.
If “all”, the imputer mask will represent all features.
- sparseboolean or “auto”, default=None
Whether the imputer mask format should be sparse or dense.
If “auto” (default), the imputer mask will be of same type as input.
If True, the imputer mask will be a sparse matrix.
If False, the imputer mask will be a numpy array.
- error_on_newboolean, default=None
If True (default), transform will raise an error when there are features with missing values in transform that have no missing values in fit. This is applicable only when
features="missing-only"
.
Examples
>>> import numpy as np >>> from sklearn.impute import MissingIndicator >>> X1 = np.array([[np.nan, 1, 3], ... [4, 0, np.nan], ... [8, 1, 0]]) >>> X2 = np.array([[5, 1, np.nan], ... [np.nan, 2, 3], ... [2, 4, 0]]) >>> indicator = MissingIndicator() >>> indicator.fit(X1) MissingIndicator() >>> X2_tr = indicator.transform(X2) >>> X2_tr array([[False, True], [ True, False], [False, False]])
- Attributes
- features_ndarray, shape (n_missing_features,) or (n_features,)
The features indices which will be returned when calling
transform
. They are computed duringfit
. Forfeatures='all'
, it is torange(n_features)
.
Methods
fit
(X[, y])Fit the transformer on X.
fit_transform
(X[, y])Generate missing values indicator for X.
Returns a list of hyperparameter names owned by this class.
transform
(X)Generate missing values indicator for X.
- fit(X, y=None) MissingIndicator [source]#
Fit the transformer on X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Input data, where
n_samples
is the number of samples andn_features
is the number of features.
- Returns
- selfobject
Returns self.
- fit_transform(X, y=None) SparseCumlArray [source]#
Generate missing values indicator for X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
The input data to complete.
- Returns
- Xt{ndarray or sparse matrix}, shape (n_samples, n_features) or (n_samples, n_features_with_missing)
The missing indicator for input data. The data type of
Xt
will be boolean.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- transform(X) SparseCumlArray [source]#
Generate missing values indicator for X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
The input data to complete.
- Returns
- Xt{ndarray or sparse matrix}, shape (n_samples, n_features) or (n_samples, n_features_with_missing)
The missing indicator for input data. The data type of
Xt
will be boolean.
- class cuml.preprocessing.PolynomialFeatures(*args, **kwargs)[source]#
Generate polynomial and interaction features.
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
- Parameters
- degreeinteger
The degree of the polynomial features. Default = 2.
- interaction_onlyboolean, default = False
If true, only interaction features are produced: features that are products of at most
degree
distinct input features (so notx[1] ** 2
,x[0] * x[2] ** 3
, etc.).- include_biasboolean
If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
- orderstr in {‘C’, ‘F’}, default ‘C’
Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.
Notes
Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.
Examples
>>> import numpy as np >>> from cuml.preprocessing import PolynomialFeatures >>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) >>> poly = PolynomialFeatures(interaction_only=True) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0.], [ 1., 2., 3., 6.], [ 1., 4., 5., 20.]])
- Attributes
- powers_array, shape (n_output_features, n_input_features)
powers_[i, j] is the exponent of the jth input in the ith output.
- n_input_features_int
The total number of input features.
- n_output_features_int
The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.
Methods
fit
(X[, y])Compute number of output features.
get_feature_names
([input_features])Return feature names for output features
Returns a list of hyperparameter names owned by this class.
transform
(X)Transform data to polynomial features
- fit(X, y=None) PolynomialFeatures [source]#
Compute number of output features.
- Parameters
- Xarray-like, shape (n_samples, n_features)
The data.
- Returns
- selfinstance
- get_feature_names(input_features=None)[source]#
Return feature names for output features
- Parameters
- input_featureslist of string, length n_features, optional
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.
- Returns
- output_feature_nameslist of string, length n_output_features
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- transform(X) SparseCumlArray [source]#
Transform data to polynomial features
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data to transform, row by row.
Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.
If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.
- Returns
- XP{array-like, sparse matrix}, shape [n_samples, NP]
The matrix of features, where NP is the number of polynomial features generated from the combination of inputs.
- class cuml.preprocessing.PowerTransformer(*args, **kwargs)[source]#
Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
- Parameters
- methodstr, (default=’yeo-johnson’)
The power transform method. Available methods are:
- standardizeboolean, default=True
Set to True to apply zero-mean, unit-variance normalization to the transformed output.
- copyboolean, optional, default=True
Set to False to perform inplace computation during transformation.
See also
power_transform
Equivalent function without the estimator API.
QuantileTransformer
Maps data to a standard normal distribution with the parameter
output_distribution='normal'
.
Notes
NaNs are treated as missing values: disregarded in
fit
, and maintained intransform
.References
- 1
I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).
- 2
G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).
Examples
>>> import cupy as cp >>> from cuml.preprocessing import PowerTransformer >>> pt = PowerTransformer() >>> data = cp.array([[1, 2], [3, 2], [4, 5]]) >>> print(pt.fit(data)) PowerTransformer() >>> print(pt.lambdas_) [ 1.386... -3.100...] >>> print(pt.transform(data)) [[-1.316... -0.707...] [ 0.209... -0.707...] [ 1.106... 1.414...]]
- Attributes
- lambdas_array of float, shape (n_features,)
The parameters of the power transformation for the selected features.
Methods
fit
(X[, y])Estimate the optimal parameter lambda for each feature.
fit_transform
(X[, y])Fit to data, then transform it.
Returns a list of hyperparameter names owned by this class.
Apply the inverse power transformation using the fitted lambdas.
transform
(X)Apply the power transform to each feature using the fitted lambdas.
- fit(X, y=None) PowerTransformer [source]#
Estimate the optimal parameter lambda for each feature.
The optimal lambda parameter for minimizing skewness is estimated on each feature independently using maximum likelihood.
- Parameters
- Xarray-like, shape (n_samples, n_features)
The data used to estimate the optimal transformation parameters.
- yIgnored
- Returns
- selfobject
- fit_transform(X, y=None) CumlArray [source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- X{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
- yndarray of shape (n_samples,), default=None
Target values.
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X) CumlArray [source]#
Apply the inverse power transformation using the fitted lambdas.
The inverse of the Box-Cox transformation is given by:
if lambda_ == 0: X = exp(X_trans) else: X = (X_trans * lambda_ + 1) ** (1 / lambda_)
The inverse of the Yeo-Johnson transformation is given by:
if X >= 0 and lambda_ == 0: X = exp(X_trans) - 1 elif X >= 0 and lambda_ != 0: X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1 elif X < 0 and lambda_ != 2: X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_)) elif X < 0 and lambda_ == 2: X = 1 - exp(-X_trans)
- Parameters
- Xarray-like, shape (n_samples, n_features)
The transformed data.
- Returns
- Xarray-like, shape (n_samples, n_features)
The original data
- class cuml.preprocessing.QuantileTransformer(*args, **kwargs)[source]#
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
- Parameters
- n_quantilesint, optional (default=1000 or n_samples)
Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
- output_distributionstr, optional (default=’uniform’)
Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
- ignore_implicit_zerosbool, optional (default=False)
Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
- subsampleint, optional (default=1e5)
Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
- random_stateint, RandomState instance or None, optional (default=None)
Determines random number generation for subsampling and smoothing noise. Please see
subsample
for more details. Pass an int for reproducible results across multiple function calls. See Glossary- copyboolean, optional, (default=True)
Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
See also
quantile_transform
Equivalent function without the estimator API.
PowerTransformer
Perform mapping to a normal distribution using a power transform.
StandardScaler
Perform standardization that is faster, but less robust to outliers.
RobustScaler
Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
Examples
>>> import cupy as cp >>> from cuml.preprocessing import QuantileTransformer >>> rng = cp.random.RandomState(0) >>> X = cp.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0) >>> qt = QuantileTransformer(n_quantiles=10, random_state=0) >>> qt.fit_transform(X) array([...])
- Attributes
- n_quantiles_integer
The actual number of quantiles used to discretize the cumulative distribution function.
- quantiles_ndarray, shape (n_quantiles, n_features)
The values corresponding the quantiles of reference.
- references_ndarray, shape(n_quantiles, )
Quantiles of references.
Methods
fit
(X[, y])Compute the quantiles used for transforming.
Returns a list of hyperparameter names owned by this class.
Back-projection to the original space.
transform
(X)Feature-wise transformation of the data.
- fit(X, y=None) QuantileTransformer [source]#
Compute the quantiles used for transforming.
- Parameters
- Xndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative ifignore_implicit_zeros
is False.
- Returns
- selfobject
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- inverse_transform(X) SparseCumlArray [source]#
Back-projection to the original space.
- Parameters
- Xndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative ifignore_implicit_zeros
is False.
- Returns
- Xtndarray or sparse matrix, shape (n_samples, n_features)
The projected data.
- transform(X) SparseCumlArray [source]#
Feature-wise transformation of the data.
- Parameters
- Xndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative ifignore_implicit_zeros
is False.
- Returns
- Xtndarray or sparse matrix, shape (n_samples, n_features)
The projected data.
- class cuml.preprocessing.SimpleImputer(*args, **kwargs)[source]#
Imputation transformer for completing missing values.
- Parameters
- missing_valuesnumber, string, np.nan (default) or None
The placeholder for the missing values. All occurrences of
missing_values
will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values,missing_values
should be set tonp.nan
, sincepd.NA
will be converted tonp.nan
.- strategystring, default=’mean’
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
strategy=”constant” for fixed value imputation.
- fill_valuestring or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
- verboseinteger, default=0
Controls the verbosity of the imputer.
- copyboolean, default=True
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if
copy=False
:If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.
- add_indicatorboolean, default=False
If True, a
MissingIndicator
transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
See also
IterativeImputer
Multivariate imputation of missing values.
Notes
Columns which only contained missing values at
fit()
are discarded upontransform()
if strategy is not “constant”.Examples
>>> import cupy as cp >>> from cuml.preprocessing import SimpleImputer >>> imp_mean = SimpleImputer(missing_values=cp.nan, strategy='mean') >>> imp_mean.fit(cp.asarray([[7, 2, 3], [4, cp.nan, 6], [10, 5, 9]])) SimpleImputer() >>> X = [[cp.nan, 2, 3], [4, cp.nan, 6], [10, cp.nan, 9]] >>> print(imp_mean.transform(cp.asarray(X))) [[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
- Attributes
- statistics_array of shape (n_features,)
The imputation fill value for each feature. Computing statistics can result in
np.nan
values. Duringtransform()
, features corresponding tonp.nan
statistics will be discarded.
Methods
fit
(X[, y])Fit the imputer on X.
Returns a list of hyperparameter names owned by this class.
transform
(X)Impute all missing values in X.
- fit(X, y=None) SimpleImputer [source]#
Fit the imputer on X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Input data, where
n_samples
is the number of samples andn_features
is the number of features.
- Returns
- selfSimpleImputer
- cuml.preprocessing.add_dummy_feature(X, value=1.0)[source]#
Augment dataset with an additional dummy feature.
This is useful for fitting an intercept term with implementations which cannot otherwise fit it directly.
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
Data.
- valuefloat
Value to use for the dummy feature.
- Returns
- X{array, sparse matrix}, shape [n_samples, n_features + 1]
Same data with dummy feature added as first column.
Examples
>>> from cuml.preprocessing import add_dummy_feature >>> import cupy as cp >>> add_dummy_feature(cp.array([[0, 1], [1, 0]])) array([[1., 0., 1.], [1., 1., 0.]])
- cuml.preprocessing.binarize(X, *, threshold=0.0, copy=True)[source]#
Boolean thresholding of array-like or sparse matrix
- Parameters
- X{array-like, sparse matrix}, shape [n_samples, n_features]
The data to binarize, element by element.
- thresholdfloat, optional (0.0 by default)
Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
- copyboolean, optional, default True
Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
See also
Binarizer
Performs binarization using the
Transformer
API
- class cuml.compose.ColumnTransformer(*args, **kwargs)[source]#
Applies transformers to columns of an array or dataframe.
This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
- Parameters
- transformerslist of tuples
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data:
- namestr
Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using
set_params
and searched in grid search.
- columnsstr, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where
transformer
expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input dataX
and can return any of the above. To select multiple columns by name or dtype, you can usemake_column_selector
.
- remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’
By default, only the specified columns in
transformers
are transformed and combined in the output, and the non-specified columns are dropped. (default of'drop'
). By specifyingremainder='passthrough'
, all remaining columns that were not specified intransformers
will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By settingremainder
to be an estimator, the remaining non-specified columns will use theremainder
estimator. The estimator must supportfit
andtransform
. Note that using this feature requires that the DataFrame columns input atfit
andtransform
have identical order.- sparse_thresholdfloat, default=0.3
If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use
sparse_threshold=0
to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.- n_jobsint, default=None
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. for more details.- transformer_weightsdict, default=None
Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
- verbosebool, default=False
If True, the time elapsed while fitting each transformer will be printed as it is completed.
See also
make_column_transformer
Convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.
make_column_selector
Convenience function for selecting columns based on datatype or the columns name with a regex pattern.
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the
transformers
list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in thepassthrough
keyword. Those columns specified withpassthrough
are added at the right to the output of the transformers.Examples
>>> import cupy as cp >>> from cuml.compose import ColumnTransformer >>> from cuml.preprocessing import Normalizer >>> ct = ColumnTransformer( ... [("norm1", Normalizer(norm='l1'), [0, 1]), ... ("norm2", Normalizer(norm='l1'), slice(2, 4))]) >>> X = cp.array([[0., 1., 2., 2.], ... [1., 1., 0., 1.]]) >>> # Normalizer scales each row of X to unit norm. A separate scaling >>> # is applied for the two first and two last elements of each >>> # row independently. >>> ct.fit_transform(X) array([[0. , 1. , 0.5, 0.5], [0.5, 0.5, 0. , 1. ]])
- Attributes
- transformers_list
The collection of fitted transformers as tuples of (name, fitted_transformer, column).
fitted_transformer
can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be the unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to theremainder
parameter. If there are remaining columns, thenlen(transformers_)==len(transformers)+1
, otherwiselen(transformers_)==len(transformers)
.named_transformers_
Bunch
Access the fitted transformer by name.
- sparse_output_bool
Boolean flag indicating whether the output of
transform
is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and thesparse_threshold
keyword.
- fit(X, y=None) ColumnTransformer [source]#
Fit all transformers using X.
- Parameters
- X{array-like, dataframe} of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray-like of shape (n_samples,…), default=None
Targets for supervised learning.
- Returns
- selfColumnTransformer
This estimator
- fit_transform(X, y=None) SparseCumlArray [source]#
Fit all transformers, transform the data and concatenate results.
- Parameters
- X{array-like, dataframe} of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray-like of shape (n_samples,), default=None
Targets for supervised learning.
- Returns
- X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- get_feature_names()[source]#
Get feature names from all transformers.
- Returns
- feature_nameslist of strings
Names of the features produced by transform.
- get_params(deep=True)[source]#
Get parameters for this estimator.
Returns the parameters given in the constructor as well as the estimators contained within the
transformers
of theColumnTransformer
.- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- property named_transformers_#
Access the fitted transformer by name.
Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.
- set_params(**kwargs)[source]#
Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params()
. Note that you can directly set the parameters of the estimators contained intransformers
ofColumnTransformer
.- Returns
- self
- transform(X) SparseCumlArray [source]#
Transform X separately by each transformer, concatenate results.
- Parameters
- X{array-like, dataframe} of shape (n_samples, n_features)
The data to be transformed by subset.
- Returns
- X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- class cuml.compose.make_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None)[source]#
Create a callable to select columns to be used with
ColumnTransformer
.make_column_selector()
can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.- Parameters
- patternstr, default=None
Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.
- dtype_includecolumn dtype or list of column dtypes, default=None
A selection of dtypes to include. For more details, see
pandas.DataFrame.select_dtypes()
.- dtype_excludecolumn dtype or list of column dtypes, default=None
A selection of dtypes to exclude. For more details, see
pandas.DataFrame.select_dtypes()
.
- Returns
- selectorcallable
Callable for column selection to be used by a
ColumnTransformer
.
See also
ColumnTransformer
Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.
Examples
>>> from cuml.preprocessing import StandardScaler, OneHotEncoder >>> from cuml.compose import make_column_transformer >>> from cuml.compose import make_column_selector >>> import cupy as cp >>> import cudf >>> X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], ... 'rating': [5, 3, 4, 5]}) >>> ct = make_column_transformer( ... (StandardScaler(), ... make_column_selector(dtype_include=cp.number)), # rating ... (OneHotEncoder(), ... make_column_selector(dtype_include=object))) # city >>> ct.fit_transform(X) array([[ 0.90453403, 1. , 0. , 0. ], [-1.50755672, 1. , 0. , 0. ], [-0.30151134, 0. , 1. , 0. ], [ 0.90453403, 0. , 0. , 1. ]])
- cuml.compose.make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False)[source]#
Construct a ColumnTransformer from the given transformers.
This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting with
transformer_weights
.- Parameters
- *transformerstuples
Tuples of the form (transformer, columns) specifying the transformer objects to be applied to subsets of the data:
- transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support
fit
andtransform
. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
- columnsstr, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where
transformer
expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input dataX
and can return any of the above. To select multiple columns by name or dtype, you can usemake_column_selector
.
- remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’
By default, only the specified columns in
transformers
are transformed and combined in the output, and the non-specified columns are dropped. (default of'drop'
). By specifyingremainder='passthrough'
, all remaining columns that were not specified intransformers
will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By settingremainder
to be an estimator, the remaining non-specified columns will use theremainder
estimator. The estimator must supportfit
andtransform
.- sparse_thresholdfloat, default=0.3
If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use
sparse_threshold=0
to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.- n_jobsint, default=None
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. SeeGlossary
for more details.- verbosebool, default=False
If True, the time elapsed while fitting each transformer will be printed as it is completed.
- Returns
- ctColumnTransformer
See also
ColumnTransformer
Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.
Examples
>>> from cuml.preprocessing import StandardScaler, OneHotEncoder >>> from cuml.compose import make_column_transformer >>> make_column_transformer( ... (StandardScaler(), ['numerical_column']), ... (OneHotEncoder(), ['categorical_column'])) ColumnTransformer(transformers=[('standardscaler', StandardScaler(...), ['numerical_column']), ('onehotencoder', OneHotEncoder(...), ['categorical_column'])])
Text Preprocessing (Single-GPU)#
- class cuml.preprocessing.text.stem.PorterStemmer(mode='NLTK_EXTENSIONS')[source]#
A word stemmer based on the Porter stemming algorithm.
Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.
See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.
Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. Only below mode is supported currently PorterStemmer.NLTK_EXTENSIONS
Implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web.
- Parameters
- mode: Modes of stemming (Only supports (NLTK_EXTENSIONS) currently)
default(“NLTK_EXTENSIONS”)
Examples
>>> import cudf >>> from cuml.preprocessing.text.stem import PorterStemmer >>> stemmer = PorterStemmer() >>> word_str_ser = cudf.Series(['revival','singing','adjustable']) >>> print(stemmer.stem(word_str_ser)) 0 reviv 1 sing 2 adjust dtype: objectMethods
stem
(word_str_ser)Stem Words using Porter stemmer
Feature and Label Encoding (Dask-based Multi-GPU)#
- class cuml.dask.preprocessing.LabelBinarizer(*, client=None, **kwargs)[source]#
A distributed version of LabelBinarizer for one-hot encoding a collection of labels.
Examples
Create an array with labels and dummy encode them
>>> import cupy as cp >>> import cupyx >>> from cuml.dask.preprocessing import LabelBinarizer >>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client >>> import dask >>> cluster = LocalCUDACluster() >>> client = Client(cluster) >>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1], ... dtype=cp.int32) >>> labels = dask.array.from_array(labels) >>> lb = LabelBinarizer() >>> encoded = lb.fit_transform(labels) >>> print(encoded.compute()) [[1 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0] [0 0 1 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 1 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0] [0 1 0 0 0 0 0 0]] >>> decoded = lb.inverse_transform(encoded) >>> print(decoded.compute()) [ 0 5 10 7 2 4 1 0 0 4 3 2 1] >>> client.close() >>> cluster.close()Methods
fit
(y)Fit label binarizer
Fit the label encoder and return transformed labels
inverse_transform
(y[, threshold])Invert a set of encoded labels back to original labels
transform
(y)Transform and return encoded labels
- fit(y)[source]#
Fit label binarizer
- Parameters
- yDask.Array of shape [n_samples,] or [n_samples, n_classes]
chunked by row. Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
- Returns
- selfreturns an instance of self.
- fit_transform(y)[source]#
Fit the label encoder and return transformed labels
- Parameters
- yDask.Array of shape [n_samples,] or [n_samples, n_classes]
target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.
- Returns
- arrDask.Array backed by CuPy arrays containing encoded labels
- class cuml.dask.preprocessing.LabelEncoder.LabelEncoder(*, client=None, verbose=False, **kwargs)[source]#
A cuDF-based implementation of ordinal label encoding
- Parameters
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.
Examples
Converting a categorical implementation to a numerical one
>>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client >>> import cudf >>> import dask_cudf >>> from cuml.dask.preprocessing import LabelEncoder >>> import pandas as pd >>> pd.set_option('display.max_colwidth', 2000) >>> cluster = LocalCUDACluster(threads_per_worker=1) >>> client = Client(cluster) >>> df = cudf.DataFrame({'num_col':[10, 20, 30, 30, 30], ... 'cat_col':['a','b','c','a','a']}) >>> ddf = dask_cudf.from_cudf(df, npartitions=2) >>> # There are two functionally equivalent ways to do this >>> le = LabelEncoder() >>> le.fit(ddf.cat_col) # le = le.fit(data.category) also works <cuml.dask.preprocessing.LabelEncoder.LabelEncoder object at 0x...> >>> encoded = le.transform(ddf.cat_col) >>> print(encoded.compute()) 0 0 1 1 2 2 3 0 4 0 dtype: uint8 >>> # This method is preferred >>> le = LabelEncoder() >>> encoded = le.fit_transform(ddf.cat_col) >>> print(encoded.compute()) 0 0 1 1 2 2 3 0 4 0 dtype: uint8 >>> # We can assign this to a new column >>> ddf = ddf.assign(encoded=encoded.values) >>> print(ddf.compute()) num_col cat_col encoded 0 10 a 0 1 20 b 1 2 30 c 2 3 30 a 0 4 30 a 0 >>> # We can also encode more data >>> test_data = cudf.Series(['c', 'a']) >>> encoded = le.transform(dask_cudf.from_cudf(test_data, ... npartitions=2)) >>> print(encoded.compute()) 0 2 1 0 dtype: uint8 >>> # After train, ordinal label can be inverse_transform() back to >>> # string labels >>> ord_label = cudf.Series([0, 0, 1, 2, 1]) >>> ord_label = le.inverse_transform( ... dask_cudf.from_cudf(ord_label,npartitions=2)) >>> print(ord_label.compute()) 0 a 1 a 2 b 3 c 4 b dtype: object >>> client.close() >>> cluster.close()Methods
fit
(y)Fit a LabelEncoder instance to a set of categories
fit_transform
(y[, delayed])Simultaneously fit and transform an input
inverse_transform
(y[, delayed])Convert the data back to the original representation.
transform
(y[, delayed])Transform an input into its categorical keys.
- fit(y)[source]#
Fit a LabelEncoder instance to a set of categories
- Parameters
- ydask_cudf.Series
Series containing the categories to be encoded. Its elements may or may not be unique
- Returns
- selfLabelEncoder
A fitted instance of itself to allow method chaining
Notes
Number of unique classes will be collected at the client. It’ll consume memory proportional to the number of unique classes.
- fit_transform(y, delayed=True)[source]#
Simultaneously fit and transform an input
This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)
- inverse_transform(y, delayed=True)[source]#
Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category.
- Parameters
- Xdask_cudf Series
The string representation of the categories.
- delayedbool (default = True)
Whether to execute as a delayed task or eager.
- Returns
- X_trdask_cudf.Series
Distributed object containing the inverse transformed array.
- transform(y, delayed=True)[source]#
Transform an input into its categorical keys.
This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer
fit_transform
.
- class cuml.dask.preprocessing.OneHotEncoder(*, client=None, verbose=False, **kwargs)[source]#
Encode categorical features as a one-hot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse
parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify thecategories
manually.
- Parameters
- categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’
Categories (unique values) per feature. All categories are expected to fit on one GPU.
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]
holds the categories expected in the feature col.- drop‘first’, None or a dict, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
Dict :
drop[col]
is the category in feature col that should be dropped.- sparsebool, default=False
This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See cupy/cupy#3223
- dtypenumber type, default=np.float
Desired datatype of transform’s output.
- handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
Methods
fit
(X)Fit a multi-node multi-gpu OneHotEncoder to X.
inverse_transform
(X[, delayed])Convert the data back to the original representation.
transform
(X[, delayed])Transform X using one-hot encoding.
- fit(X)[source]#
Fit a multi-node multi-gpu OneHotEncoder to X.
- Parameters
- XDask cuDF DataFrame or CuPy backed Dask Array
The data to determine the categories of each feature.
- Returns
- self
- inverse_transform(X, delayed=True)[source]#
Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category.
- Parameters
- XCuPy backed Dask Array, shape [n_samples, n_encoded_features]
The transformed data.
- delayedbool (default = True)
Whether to execute as a delayed task or eager.
- Returns
- X_trDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the inverse transformed array.
- transform(X, delayed=True)[source]#
Transform X using one-hot encoding.
- Parameters
- XDask cuDF DataFrame or CuPy backed Dask Array
The data to encode.
- delayedbool (default = True)
Whether to execute as a delayed task or eager.
- Returns
- outDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the transformed input.
Feature Extraction (Single-GPU)#
- class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#
Convert a collection of text documents to a matrix of token counts
If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.
- Parameters
- lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
- vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
- binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
- Attributes
- vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
- stop_words_cudf.Series[str]
- Terms that were ignored because they either:
occurred in too many documents (
max_df
)occurred in too few documents (
min_df
)were cut off by feature selection (
max_features
).This is only available if no vocabulary was given.
Methods
fit
(raw_documents[, y])Build a vocabulary of all tokens in the raw documents.
fit_transform
(raw_documents[, y])Build the vocabulary and return document-term matrix.
Array mapping from feature integer indices to feature name.
Return terms per document with nonzero entries in X.
transform
(raw_documents)Transform documents to document-term matrix.
- fit(raw_documents, y=None)[source]#
Build a vocabulary of all tokens in the raw documents.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns
- self
- fit_transform(raw_documents, y=None)[source]#
Build the vocabulary and return document-term matrix.
Equivalent to
self.fit(X).transform(X)
but preprocessX
only once.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns
- Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.
- get_feature_names()[source]#
Array mapping from feature integer indices to feature name.
- Returns
- feature_namesSeries
A list of feature names.
- inverse_transform(X)[source]#
Return terms per document with nonzero entries in X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Document-term matrix.
- Returns
- X_invlist of cudf.Series of shape (n_samples,)
List of Series of terms.
- transform(raw_documents)[source]#
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns
- Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.
- class cuml.feature_extraction.text.HashingVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.
The hash function employed is the signed 32-bit version of Murmurhash3.
- Parameters
- lowercasebool, default=True
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, default=None
If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'
.- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- n_featuresint, default=(2 ** 20)
The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
- binarybool, default=False.
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- norm{‘l1’, ‘l2’}, default=’l2’
Norm used to normalize term vectors. None for no normalization.
- alternate_signbool, default=True
When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if
stop_words
is not None. Typically the delimiting character between words is a good choice.See also
Examples
>>> from cuml.feature_extraction.text import HashingVectorizer >>> import pandas as pd >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = HashingVectorizer(n_features=2**4) >>> X = vectorizer.fit_transform(pd.Series(corpus)) >>> print(X.shape) (4, 16)Methods
fit
(X[, y])This method only checks the input type and the model parameter.
fit_transform
(X[, y])Transform a sequence of documents to a document-term matrix.
partial_fit
(X[, y])Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
transform
(raw_documents)Transform documents to document-term matrix.
- fit(X, y=None)[source]#
This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless
- Parameters
- Xcudf.Series or pd.Series
A Series of string documents
- fit_transform(X, y=None)[source]#
Transform a sequence of documents to a document-term matrix.
- Parameters
- Xiterable over raw text documents, length = n_samples
Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.
- partial_fit(X, y=None)[source]#
Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
- Parameters
- Xcudf.Series(A Series of string documents).
- transform(raw_documents)[source]#
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns
- Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.
- class cuml.feature_extraction.text.TfidfVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]#
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to
CountVectorizer
followed byTfidfTransformer
.
- Parameters
- lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
- preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
- stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
- ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams.- analyzerstring, {‘word’, ‘char’, ‘char_wb’}, default=’word’
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
- vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
- binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
- delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
- norm{‘l1’, ‘l2’}, default=’l2’
- Each output row will have unit norm, either:
‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1.
- use_idfbool, default=True
Enable inverse-document-frequency reweighting.
- smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- sublinear_tfbool, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Notes
The
stop_words_
attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.This class is largely based on scikit-learn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD-3 license.
- Attributes
- idf_array of shape (n_features)
The inverse document frequency (IDF) vector; only defined if
use_idf
is True.- vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
- stop_words_cudf.Series[str]
- Terms that were ignored because they either:
occurred in too many documents (
max_df
)occurred in too few documents (
min_df
)were cut off by feature selection (
max_features
).This is only available if no vocabulary was given.
Methods
fit
(raw_documents)Learn vocabulary and idf from training set.
fit_transform
(raw_documents[, y])Learn vocabulary and idf, return document-term matrix.
Array mapping from feature integer indices to feature name.
transform
(raw_documents)Transform documents to document-term matrix.
- fit(raw_documents)[source]#
Learn vocabulary and idf from training set.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns
- selfobject
Fitted vectorizer.
- fit_transform(raw_documents, y=None)[source]#
Learn vocabulary and idf, return document-term matrix. This is equivalent to fit followed by transform, but more efficiently implemented.
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- yNone
Ignored.
- Returns
- Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
- get_feature_names()[source]#
Array mapping from feature integer indices to feature name.
- Returns
- feature_namesSeries
A list of feature names.
- transform(raw_documents)[source]#
Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
- Parameters
- raw_documentscudf.Series or pd.Series
A Series of string documents
- Returns
- Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
Feature Extraction (Dask-based Multi-GPU)#
- class cuml.dask.feature_extraction.text.TfidfTransformer(*, client=None, verbose=False, **kwargs)[source]#
Distributed TF-IDF transformer
Examples
>>> import cupy as cp >>> from sklearn.datasets import fetch_20newsgroups >>> from sklearn.feature_extraction.text import CountVectorizer >>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client >>> from cuml.dask.common import to_sparse_dask_array >>> from cuml.dask.naive_bayes import MultinomialNB >>> import dask >>> from cuml.dask.feature_extraction.text import TfidfTransformer >>> # Create a local CUDA cluster >>> cluster = LocalCUDACluster() >>> client = Client(cluster) >>> # Load corpus >>> twenty_train = fetch_20newsgroups(subset='train', ... shuffle=True, random_state=42) >>> cv = CountVectorizer() >>> xformed = cv.fit_transform(twenty_train.data).astype(cp.float32) >>> X = to_sparse_dask_array(xformed, client) >>> y = dask.array.from_array(twenty_train.target, asarray=False, ... fancy=False).astype(cp.int32) >>> multi_gpu_transformer = TfidfTransformer() >>> X_transformed = multi_gpu_transformer.fit_transform(X) >>> X_transformed.compute_chunk_sizes() dask.array<...> >>> model = MultinomialNB() >>> model.fit(X_transformed, y) <cuml.dask.naive_bayes.naive_bayes.MultinomialNB object at 0x...> >>> result = model.score(X_transformed, y) >>> print(result) array(0.93264981) >>> client.close() >>> cluster.close()Methods
fit
(X[, y])Fit distributed TFIDF Transformer
fit_transform
(X[, y])Fit distributed TFIDFTransformer and then transform the given set of data samples.
transform
(X[, y])Use distributed TFIDFTransformer to transform the given set of data samples.
- fit(X, y=None)[source]#
Fit distributed TFIDF Transformer
- Parameters
- Xdask.Array with blocks containing dense or sparse cupy arrays
- Returns
- cuml.dask.feature_extraction.text.TfidfTransformer instance
Dataset Generation (Single-GPU)#
- random_state#
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
- cuml.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False, order='F', dtype='float32')[source]#
Generate isotropic Gaussian blobs for clustering.
- Parameters
- n_samplesint or array-like, optional (default=100)
If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.
- n_featuresint, optional (default=2)
The number of features for each sample.
- centersint or array of shape [
n_centers
,n_features
], optional(default=None) The number of centers to generate, or the fixed center locations. If
n_samples
is an int and centers is None, 3 centers are generated. Ifn_samples
is array-like, centers must be either None or an array of length equal to the length ofn_samples
.- cluster_stdfloat or sequence of floats, optional (default=1.0)
The standard deviation of the clusters.
- center_boxpair of floats (min, max), optional (default=(-10.0, 10.0))
The bounding box for each cluster center when centers are generated at random.
- shuffleboolean, optional (default=True)
Shuffle the samples.
- random_stateint, RandomState instance, default=None
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
- return_centersbool, optional (default=False)
If True, then return the centers of each cluster
- order: str, optional (default=’F’)
The order of the generated samples
- dtypestr, optional (default=’float32’)
Dtype of the generated samples
- Returns
- Xdevice array of shape [n_samples, n_features]
The generated samples.
- ydevice array of shape [n_samples]
The integer labels for cluster membership of each sample.
- centersdevice array, shape [n_centers, n_features]
The centers of each cluster. Only returned if
return_centers=True
.See also
make_classification
a more intricate variant
Examples
>>> from sklearn.datasets import make_blobs >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2, ... random_state=0) >>> print(X.shape) (10, 2) >>> y array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0]) >>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2, ... random_state=0) >>> print(X.shape) (10, 2) >>> y array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
- cuml.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', _centroids=None, _informative_covariance=None, _redundant_covariance=None, _repeated_indices=None)[source]#
Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of an
n_informative
-dimensional hypercube with sides of length2*class_sep
and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling,X
horizontally stacks features in the following order: the primaryn_informative
features, followed byn_redundant
linear combinations of the informative features, followed byn_repeated
duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columnsX[:, :n_informative + n_redundant + n_repeated]
.
- Parameters
- n_samplesint, optional (default=100)
The number of samples.
- n_featuresint, optional (default=20)
The total number of features. These comprise
n_informative
informative features,n_redundant
redundant features,n_repeated
duplicated features andn_features-n_informative-n_redundant-n_repeated
useless features drawn at random.- n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension
n_informative
. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.- n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.
- n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.
- n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.
- n_clusters_per_classint, optional (default=2)
The number of clusters per class.
- weightsarray-like of shape (n_classes,) or (n_classes - 1,), (default=None)
The proportions of samples assigned to each class. If None, then classes are balanced. Note that if
len(weights) == n_classes - 1
, then the last class weight is automatically inferred. More thann_samples
samples may be returned if the sum ofweights
exceeds 1.- flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.
- class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.
- hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.
- shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].
- scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.
- shuffleboolean, optional (default=True)
Shuffle the samples and the features.
- random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
- order: str, optional (default=’F’)
The order of the generated samples
- dtypestr, optional (default=’float32’)
Dtype of the generated samples
- _centroids: array of centroids of shape (n_clusters, n_informative)
- _informative_covariance: array for covariance between informative features
of shape (n_clusters, n_informative, n_informative)
- _redundant_covariance: array for covariance between redundant features
of shape (n_informative, n_redundant)
- _repeated_indices: array of indices for the repeated features
of shape (n_repeated, )
- Returns
- Xdevice array of shape [n_samples, n_features]
The generated samples.
- ydevice array of shape [n_samples]
The integer labels for class membership of each sample.
Notes
The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset. How we optimized for GPUs:
Firstly, we generate X from a standard univariate instead of zeros. This saves memory as we don’t need to generate univariates each time for each feature class (informative, repeated, etc.) while also providing the added speedup of generating a big matrix on GPU
We generate
order=F
construction. We exploit the fact that X is a generated from a univariate normal, and covariance is introduced with matrix multiplications. Which means, we can generate X as a 1D array and just reshape it to the desired order, which only updates the metadata and eliminates copiesLastly, we also shuffle by construction. Centroid indices are permuted for each sample, and then we construct the data for each centroid. This shuffle works for both
order=C
andorder=F
and eliminates any need for secondary copiesReferences
- 1
I. Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.
Examples
>>> from cuml.datasets.classification import make_classification >>> X, y = make_classification(n_samples=10, n_features=4, ... n_informative=2, n_classes=2, ... random_state=10) >>> print(X) [[-1.7974224 0.24425316 0.39062843 -0.38293394] [ 0.6358963 1.4161923 0.06970507 -0.16085647] [-0.22802866 -1.1827322 0.3525861 0.276615 ] [ 1.7308872 0.43080002 0.05048406 0.29837844] [-1.9465544 0.5704457 -0.8997551 -0.27898186] [ 1.0575483 -0.9171263 0.09529338 0.01173469] [ 0.7917619 -1.0638094 -0.17599393 -0.06420116] [-0.6686142 -0.13951421 -0.6074711 0.21645583] [-0.88968956 -0.914443 0.1302423 0.02924336] [-0.8817671 -0.84549576 0.1845096 0.02556021]] >>> print(y) [1 0 1 1 1 1 1 1 1 0]
- cuml.datasets.make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None, dtype='single', handle=None) Union[Tuple[CumlArray, CumlArray], Tuple[CumlArray, CumlArray, CumlArray]] [source]#
Generate a random regression problem.
See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html
- Parameters
- n_samplesint, optional (default=100)
The number of samples.
- n_featuresint, optional (default=2)
The number of features.
- n_informativeint, optional (default=2)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.
- n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.
- biasfloat, optional (default=0.0)
The bias term in the underlying linear model.
- effective_rankint or None, optional (default=None)
- if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.
- if None:
The input set is well conditioned, centered and gaussian with unit variance.
- tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if
effective_rank
is not None.- noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
- shuffleboolean, optional (default=True)
Shuffle the samples and the features.
- coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.
- random_stateint, RandomState instance or None (default)
Seed for the random number generator for dataset creation.
- dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’.
- handle: cuml.Handle
If it is None, a new one is created just for this function call
- Returns
- outdevice array of shape [n_samples, n_features]
The input samples.
- valuesdevice array of shape [n_samples, n_targets]
The output values.
- coefdevice array of shape [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.
Examples
>>> from cuml.datasets.regression import make_regression >>> from cuml.linear_model import LinearRegression >>> # Create regression problem >>> data, values = make_regression(n_samples=200, n_features=12, ... n_informative=7, bias=-4.2, ... noise=0.3, random_state=10) >>> # Perform a linear regression on this problem >>> lr = LinearRegression(fit_intercept = True, normalize = False, ... algorithm = "eig") >>> reg = lr.fit(data, values) >>> print(reg.coef_) [-2.6980877e-02 7.7027252e+01 1.1498465e+01 8.5468025e+00 5.8548538e+01 6.0772545e+01 3.6876743e+01 4.0023815e+01 4.3908358e-03 -2.0275116e-02 3.5066366e-02 -3.4512520e-02]
- cuml.datasets.make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0), intercept=False, random_state=None, dtype='double', handle=None)[source]#
Generates a dataset of time series by simulating an ARIMA process of a given order.
- Parameters
- batch_size: int
Number of time series to generate
- n_obs: int
Number of observations per series
- orderTuple[int, int, int]
Order (p, d, q) of the simulated ARIMA process
- seasonal_order: Tuple[int, int, int, int]
Seasonal ARIMA order (P, D, Q, s) of the simulated ARIMA process
- intercept: bool or int
Whether to include a constant trend mu in the simulated ARIMA process
- random_state: int, RandomState instance or None (default)
Seed for the random number generator for dataset creation.
- dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’
- handle: cuml.Handle
If it is None, a new one is created just for this function call
- Returns
- out: array-like, shape (n_obs, batch_size)
Array of the requested type containing the generated dataset
Examples
from cuml.datasets import make_arima y = make_arima(1000, 100, (2,1,2), (0,1,2,12), 0)
Dataset Generation (Dask-based Multi-GPU)#
- cuml.dask.datasets.blobs.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=(-10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None, workers=None)[source]#
Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.
This function calls
make_blobs
fromcuml.datasets
on each Dask worker and aggregates them into a single Dask Dataframe.For more information on Scikit-learn’s make_blobs.
- Parameters
- n_samplesint
number of rows
- n_featuresint
number of features
- centersint or array of shape [n_centers, n_features],
optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.
- cluster_stdfloat (default = 1.0)
standard deviation of points around centroid
- n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)
- center_boxtuple (int, int) (default = (-10, 10))
the bounding box which constrains all the centroids
- random_stateint (default = None)
sets random seed (or use None to reinitialize each time)
- return_centersbool, optional (default=False)
If True, then return the centers of each cluster
- verboseint or boolean (default = False)
Logging level.
- shufflebool (default=False)
Shuffles the samples on each worker.
- order: str, optional (default=’F’)
The order of the generated samples
- dtypestr, optional (default=’float32’)
Dtype of the generated samples
- clientdask.distributed.Client (optional)
Dask client to use
- workersoptional, list of strings
Dask addresses of workers to use for computation. If None, all available Dask workers will be used. (e.g. :
workers = list(client.scheduler_info()['workers'].keys())
)- Returns
- Xdask.array backed by CuPy array of shape [n_samples, n_features]
The input samples.
- ydask.array backed by CuPy array of shape [n_samples]
The output values.
- centersdask.array backed by CuPy array of shape
[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.
Examples
>>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client >>> from cuml.dask.datasets import make_blobs >>> cluster = LocalCUDACluster(threads_per_worker=1) >>> client = Client(cluster) >>> workers = list(client.scheduler_info()['workers'].keys()) >>> X, y = make_blobs(1000, 10, centers=42, cluster_std=0.1, ... workers=workers) >>> client.close() >>> cluster.close()
- cuml.dask.datasets.classification.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', n_parts=None, client=None)[source]#
Generate a random n-class classification problem.
This initially creates clusters of points normally distributed (std=1) about vertices of an
n_informative
-dimensional hypercube with sides of length2 * class_sep
and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.Without shuffling,
X
horizontally stacks features in the following order: the primaryn_informative
features, followed byn_redundant
linear combinations of the informative features, followed byn_repeated
duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columnsX[:, :n_informative + n_redundant + n_repeated]
.
- Parameters
- n_samplesint, optional (default=100)
The number of samples.
- n_featuresint, optional (default=20)
The total number of features. These comprise
n_informative
informative features,n_redundant
redundant features,n_repeated
duplicated features andn_features-n_informative-n_redundant-n_repeated
useless features drawn at random.- n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension
n_informative
. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.- n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.
- n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.
- n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.
- n_clusters_per_classint, optional (default=2)
The number of clusters per class.
- weightsarray-like of shape
(n_classes,)
or(n_classes - 1,)
, (default=None)The proportions of samples assigned to each class. If None, then classes are balanced. Note that if
len(weights) == n_classes - 1
, then the last class weight is automatically inferred. More thann_samples
samples may be returned if the sum ofweights
exceeds 1.- flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.
- class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.
- hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.
- shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].
- scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.
- shuffleboolean, optional (default=True)
Shuffle the samples and the features.
- random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
- order: str, optional (default=’F’)
The order of the generated samples
- dtypestr, optional (default=’float32’)
Dtype of the generated samples
- n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)
- Returns
- Xdask.array backed by CuPy array of shape [n_samples, n_features]
The generated samples.
- ydask.array backed by CuPy array of shape [n_samples]
The integer labels for class membership of each sample.
Notes
How we extended the dask MNMG version from the single GPU version:
We generate centroids of shape
(n_centroids, n_informative)
We generate an informative covariance of shape
(n_centroids, n_informative, n_informative)
We generate a redundant covariance of shape
(n_informative, n_redundant)
We generate the indices for the repeated features We pass along the references to the futures of the above arrays with each part to the single GPU
cuml.datasets.classification.make_classification
so that each part (and worker) has access to the correct values to generate data from the same covariancesExamples
>>> from dask.distributed import Client >>> from dask_cuda import LocalCUDACluster >>> from cuml.dask.datasets.classification import make_classification >>> cluster = LocalCUDACluster() >>> client = Client(cluster) >>> X, y = make_classification(n_samples=10, n_features=4, ... random_state=1, n_informative=2, ... n_classes=2) >>> print(X.compute()) [[-1.1273878 1.2844919 -0.32349187 0.1595734 ] [ 0.80521786 -0.65946865 -0.40753683 0.15538901] [ 1.0404129 -1.481386 1.4241115 1.2664981 ] [-0.92821544 -0.6805706 -0.26001272 0.36004275] [-1.0392245 -1.1977317 0.16345565 -0.21848428] [ 1.2273135 -0.529214 2.4799604 0.44108105] [-1.9163864 -0.39505136 -1.9588828 -1.8881643 ] [-0.9788184 -0.89851004 -0.08339313 0.1130247 ] [-1.0549078 -0.8993015 -0.11921967 0.04821599] [-1.8388828 -1.4063598 -0.02838472 -1.0874642 ]] >>> print(y.compute()) [1 0 0 0 0 1 0 0 0 0] >>> client.close() >>> cluster.close()
- cuml.dask.datasets.regression.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None, n_parts=1, n_samples_per_part=None, dtype='float32')[source]#
Generate a mostly low rank matrix with bell-shaped singular values
- Parameters
- n_samplesint, optional (default=100)
The number of samples.
- n_featuresint, optional (default=100)
The number of features.
- effective_rankint, optional (default=10)
The approximate number of singular vectors required to explain most of the data by linear combinations.
- tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile.
- random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
- n_partsint, optional (default=1)
The number of parts of work.
- dtype: str, optional (default=’float32’)
dtype of generated data
- Returns
- XDask-CuPy array of shape [n_samples, n_features]
The matrix.
- cuml.dask.datasets.regression.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=False, coef=False, random_state=None, n_parts=1, n_samples_per_part=None, order='F', dtype='float32', client=None, use_full_low_rank=True)[source]#
Generate a random regression problem.
The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.
The output is generated by applying a (potentially biased) random linear regression model with “n_informative” nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.
- Parameters
- n_samplesint, optional (default=100)
The number of samples.
- n_featuresint, optional (default=100)
The number of features.
- n_informativeint, optional (default=10)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.
- n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.
- biasfloat, optional (default=0.0)
The bias term in the underlying linear model.
- effective_rankint or None, optional (default=None)
- if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.
- if None:
The input set is well conditioned, centered and gaussian with unit variance.
- tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if “effective_rank” is not None.
- noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
- shuffleboolean, optional (default=False)
Shuffle the samples and the features.
- coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.
- random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
- n_partsint, optional (default=1)
The number of parts of work.
- orderstr, optional (default=’F’)
Row-major or Col-major
- dtype: str, optional (default=’float32’)
dtype of generated data
- use_full_low_rankboolean (default=True)
Whether to use the entire dataset to generate the low rank matrix. If False, it creates a low rank covariance and uses the corresponding covariance to generate a multivariate normal distribution on the remaining chunks
- Returns
- XDask-CuPy array of shape [n_samples, n_features]
The input samples.
- yDask-CuPy array of shape [n_samples] or [n_samples, n_targets]
The output values.
- coefDask-CuPy array of shape [n_features] or [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.
Notes
- Known Performance Limitations:
When
effective_rank
is set anduse_full_low_rank
is True, we cannot generate orderF
by construction, and an explicit transpose is performed on each part. This may cause memory to spike (other parameters make orderF
by construction)When
n_targets > 1
andorder = 'F'
as above, we have to explicitly transpose they
array. Ifcoef = True
, then we also explicitly transpose theground_truth
arrayWhen
shuffle = True
andorder = F
, there are memory spikes to shuffle theF
order arraysNote
If out-of-memory errors are encountered in any of the above configurations, try increasing the
n_parts
parameter.
Metrics (regression, classification, and distance)#
- cuml.metrics.regression.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')[source]#
Mean absolute error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
- Parameters
- y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
- y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
- sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.
- multioutputstring in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
- Returns
- lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
- cuml.metrics.regression.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]#
Mean squared error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
- Parameters
- y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
- y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
- sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.
- multioutputstring in [‘raw_values’, ‘uniform_average’] (default=’uniform_average’)
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
- squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.
- Returns
- lossfloat or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- cuml.metrics.regression.mean_squared_log_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]#
Mean squared log error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
- Parameters
- y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
- y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
- sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.
- multioutputstring in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
- squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.
- Returns
- lossfloat or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- cuml.metrics.regression.r2_score(y, y_hat, convert_dtype=True, handle=None) float [source]#
Calculates r2 score between y and y_hat
- Parameters
- yarray-like (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- y_hatarray-like (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- convert_dtypebool, optional (default = False)
When set to True, the fit method will, when necessary, convert y_hat to be the same data type as y if they differ. This will increase memory used for the method.
- Returns
- trustworthiness scoredouble
Trustworthiness of the low-dimensional embedding
- cuml.metrics.accuracy.accuracy_score(ground_truth, predictions, handle=None, convert_dtype=True)[source]#
Calculates the accuracy score of a classification model.
- Parameters
- handlecuml.Handle
- predictionNumPy ndarray or Numba device
The labels predicted by the model for the test dataset
- ground_truthNumPy ndarray, Numba device
The ground truth labels of the test dataset
- Returns
- float
The accuracy of the model used for prediction
- cuml.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None, normalize=None, convert_dtype=False) CumlArray [source]#
Compute confusion matrix to evaluate the accuracy of a classification.
- Parameters
- y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
- y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
- labelsarray-like (device or host) shape = (n_classes,), optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.
- sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.
- normalizestring in [‘true’, ‘pred’, ‘all’] or None (default=None)
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
- convert_dtypebool, optional (default=False)
When set to True, the confusion matrix method will automatically convert the predictions, ground truth, and labels arrays to np.int32.
- Returns
- Carray-like (device or host) shape = (n_classes, n_classes)
Confusion matrix.
- cuml.metrics.kl_divergence(P, Q, handle=None, convert_dtype=True)[source]#
Calculates the “Kullback-Leibler” Divergence The KL divergence tells us how well the probability distribution Q approximates the probability distribution P It is often also used as a ‘distance metric’ between two probability distributions (not symmetric)
- Parameters
- PDense array of probabilities corresponding to distribution P
shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.
- QDense array of probabilities corresponding to distribution Q
shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.
- handlecuml.Handle
- convert_dtypebool, optional (default = True)
When set to True, the method will, convert P and Q to be the same data type: float32. This will increase memory used for the method.
- Returns
- float
The KL Divergence value
- cuml.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None) float [source]#
Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns
y_pred
probabilities for its training datay_true
. The log loss is only defined for two or more labels.
- Parameters
- y_truearray-like, shape = (n_samples,)
- y_predarray-like of float,
shape = (n_samples, n_classes) or (n_samples,)
- epsfloat (default=1e-15)
Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).
- normalizebool, optional (default=True)
If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- lossfloat
Notes
The logarithm used is the natural logarithm (base-e).
References
C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, p. 209.
Examples
>>> from cuml.metrics import log_loss >>> import cupy as cp >>> log_loss(cp.array([1, 0, 0, 1]), ... cp.array([[.1, .9], [.9, .1], [.8, .2], [.35, .65]])) 0.21616...
- cuml.metrics.roc_auc_score(y_true, y_score)[source]#
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Note
this implementation can only be used with binary classification.
- Parameters
- y_truearray-like of shape (n_samples,)
True labels. The binary cases expect labels with shape (n_samples,)
- y_scorearray-like of shape (n_samples,)
Target scores. In the binary cases, these can be either probability estimates or non-thresholded decision values (as returned by
decision_function
on some classifiers). The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.- Returns
- aucfloat
Examples
>>> import numpy as np >>> from cuml.metrics import roc_auc_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> print(roc_auc_score(y_true, y_scores)) 0.75
- cuml.metrics.precision_recall_curve(y_true, probs_pred) Tuple[CumlArray, CumlArray, CumlArray] [source]#
Compute precision-recall pairs for different probability thresholds
Note
this implementation is restricted to the binary classification task. The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.Read more in the scikit-learn’s User Guide.
- Parameters
- y_truearray, shape = [n_samples]
True binary labels, {0, 1}.
- probas_predarray, shape = [n_samples]
Estimated probabilities or decision function.
- Returns
- precisionarray, shape = [n_thresholds + 1]
Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.
- recallarray, shape = [n_thresholds + 1]
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
- thresholdsarray, shape = [n_thresholds <= len(np.unique(probas_pred))]
Increasing thresholds on the decision function used to compute precision and recall.
Examples
>>> import cupy as cp >>> from cuml.metrics import precision_recall_curve >>> y_true = cp.array([0, 0, 1, 1]) >>> y_scores = cp.array([0.1, 0.4, 0.35, 0.8]) >>> precision, recall, thresholds = precision_recall_curve( ... y_true, y_scores) >>> print(precision) [0.666... 0.5 1. 1. ] >>> print(recall) [1. 0.5 0.5 0. ] >>> print(thresholds) [0.35 0.4 0.8 ]
- cuml.metrics.pairwise_distances.nan_euclidean_distances(X, Y=None, *, squared=False, missing_values=nan, convert_dtype=True)[source]#
Calculate the euclidean distances in the presence of missing values.
Compute the euclidean distance between each pair of samples in X and Y, where Y=X is assumed if Y=None. When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates:
dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates
For example, the distance between
[3, na, na, 6]
and[1, na, 4, 5]
is:\[\sqrt{\frac{4}{2}((3-1)^2 + (6-5)^2)}\]If all the coordinates are missing or if there are no common present coordinates then NaN is returned for that pair.
- Parameters
- XDense matrix of shape (n_samples_X, n_features)
Acceptable formats: cuDF DataFrame, Pandas DataFrame, NumPy ndarray, cuda array interface compliant array like CuPy.
- YDense matrix of shape (n_samples_Y, n_features), default=None
Acceptable formats: cuDF DataFrame, Pandas DataFrame, NumPy ndarray, cuda array interface compliant array like CuPy.
- squaredbool, default=False
Return squared Euclidean distances.
- missing_valuesnp.nan or int, default=np.nan
Representation of missing value.
- Returns
- distancesndarray of shape (n_samples_X, n_samples_Y)
Returns the distances between the row vectors of
X
and the row vectors ofY
.
- cuml.metrics.pairwise_distances.pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]#
Compute the distance matrix from a vector array
X
and optionalY
.This method takes either one or two vector arrays, and returns a distance matrix.
If
Y
is given (default isNone
), then the returned matrix is the pairwise distance between the arrays from bothX
andY
.Valid values for metric are:
- From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
Sparse matrices are supported, see ‘sparse_pairwise_distances’.
- From scipy.spatial.distance: [‘sqeuclidean’]
See the documentation for scipy.spatial.distance for details on this metric. Sparse matrices are supported.
- Parameters
- XDense or sparse matrix (device or host) of shape
(n_samples_x, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy, or cupyx.scipy.sparse for sparse input
- Yarray-like (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”}
The metric to use when calculating distance between instances in a feature array.
- convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.
- Returns
- Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix
X
, ifY
is None. IfY
is notNone
, then D_{i, j} is the distance between the ith array fromX
and the jth array fromY
.Examples
>>> import cupy as cp >>> from cuml.metrics import pairwise_distances>>> X = cp.array([[2.0, 3.0], [3.0, 5.0], [5.0, 8.0]]) >>> Y = cp.array([[1.0, 0.0], [2.0, 1.0]])>>> # Euclidean Pairwise Distance, Single Input: >>> pairwise_distances(X, metric='euclidean') array([[0. , 2.236..., 5.830...], [2.236..., 0. , 3.605...], [5.830..., 3.605..., 0. ]])>>> # Cosine Pairwise Distance, Multi-Input: >>> pairwise_distances(X, Y, metric='cosine') array([[0.445... , 0.131...], [0.485..., 0.156...], [0.470..., 0.146...]])>>> # Manhattan Pairwise Distance, Multi-Input: >>> pairwise_distances(X, Y, metric='manhattan') array([[ 4., 2.], [ 7., 5.], [12., 10.]])
- cuml.metrics.pairwise_distances.sparse_pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]#
Compute the distance matrix from a vector array
X
and optionalY
.This method takes either one or two sparse vector arrays, and returns a dense distance matrix.
If
Y
is given (default isNone
), then the returned matrix is the pairwise distance between the arrays from bothX
andY
.Valid values for metric are:
From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
- From scipy.spatial.distance: [‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘jaccard’, ‘chebyshev’, ‘dice’]
See the documentation for scipy.spatial.distance for details on these metrics.
[‘inner_product’, ‘hellinger’]
- Parameters
- Xarray-like (device or host) of shape (n_samples_x, n_features)
Acceptable formats: SciPy or Cupy sparse array
- Yarray-like (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: SciPy or Cupy sparse array
- metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”, “canberra”, “lp”, “inner_product”, “minkowski”, “jaccard”, “hellinger”, “chebyshev”, “linf”, “dice”}
The metric to use when calculating distance between instances in a feature array.
- convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.
- metric_argfloat, optional (default = 2)
Additional metric-specific argument. For Minkowski it’s the p-norm to apply.
- Returns
- Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A dense distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix
X
, ifY
is None. IfY
is notNone
, then D_{i, j} is the distance between the ith array fromX
and the jth array fromY
.Examples
>>> import cupyx >>> from cuml.metrics import sparse_pairwise_distances >>> X = cupyx.scipy.sparse.random(2, 3, density=0.5, random_state=9) >>> Y = cupyx.scipy.sparse.random(1, 3, density=0.5, random_state=9) >>> X.todense() array([[0.8098..., 0.537..., 0. ], [0. , 0.856..., 0. ]]) >>> Y.todense() array([[0. , 0. , 0.993...]]) >>> # Cosine Pairwise Distance, Single Input: >>> sparse_pairwise_distances(X, metric='cosine') array([[0. , 0.447...], [0.447..., 0. ]]) >>> # Squared euclidean Pairwise Distance, Multi-Input: >>> sparse_pairwise_distances(X, Y, metric='sqeuclidean') array([[1.931...], [1.720...]]) >>> # Canberra Pairwise Distance, Multi-Input: >>> sparse_pairwise_distances(X, Y, metric='canberra') array([[3.], [2.]])
- cuml.metrics.pairwise_kernels.pairwise_kernels(X, Y=None, metric='linear', *, filter_params=False, convert_dtype=True, **kwds)[source]#
Compute the kernel between arrays X and optional array Y. This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector array, the kernels are computed. If the input is a kernel matrix, it is returned instead. This method provides a safe way to take a kernel matrix as input, while preserving compatibility with many other algorithms that take a vector array. If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X and Y. Valid values for metric are: [‘additive_chi2’, ‘chi2’, ‘linear’, ‘poly’, ‘polynomial’, ‘rbf’, ‘laplacian’, ‘sigmoid’, ‘cosine’]
- Parameters
- XDense matrix (device or host) of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)
Array of pairwise kernels between samples, or a feature array. The shape of the array should be (n_samples_X, n_samples_X) if metric == “precomputed” and (n_samples_X, n_features) otherwise. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- YDense matrix (device or host) of shape (n_samples_Y, n_features), default=None
A second feature array only if X has shape (n_samples_X, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- metricstr or callable (numba device function), default=”linear”
The metric to use when calculating kernel between instances in a feature array. If metric is “precomputed”, X is assumed to be a kernel matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two rows from X as input and return the corresponding kernel value as a single number.
- filter_paramsbool, default=False
Whether to filter invalid parameters or not.
- convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.
- **kwdsoptional keyword parameters
Any further parameters are passed directly to the kernel function.
- Returns
- Kndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)
A kernel matrix K such that K_{i, j} is the kernel between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then K_{i, j} is the kernel between the ith array from X and the jth array from Y.
Notes
If metric is ‘precomputed’, Y is ignored and X is returned.
Examples
>>> import cupy as cp >>> from cuml.metrics import pairwise_kernels >>> from numba import cuda >>> import math >>> X = cp.array([[2, 3], [3, 5], [5, 8]]) >>> Y = cp.array([[1, 0], [2, 1]]) >>> pairwise_kernels(X, Y, metric='linear') array([[ 2, 7], [ 3, 11], [ 5, 18]]) >>> @cuda.jit(device=True) ... def custom_rbf_kernel(x, y, gamma=None): ... if gamma is None: ... gamma = 1.0 / len(x) ... sum = 0.0 ... for i in range(len(x)): ... sum += (x[i] - y[i]) ** 2 ... return math.exp(-gamma * sum) >>> pairwise_kernels(X, Y, metric=custom_rbf_kernel) array([[6.73794700e-03, 1.35335283e-01], [5.04347663e-07, 2.03468369e-04], [4.24835426e-18, 2.54366565e-13]])
Metrics (clustering and manifold learning)#
- cuml.metrics.trustworthiness.trustworthiness(X, X_embedded, handle=None, n_neighbors=5, metric='euclidean', convert_dtype=True, batch_size=512) float [source]#
Expresses to what extent the local structure is retained in embedding. The score is defined in the range [0, 1].
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- X_embeddedarray-like (device or host) shape= (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- n_neighborsint, optional (default=5)
Number of neighbors considered
- metricstr in [‘euclidean’] (default=’euclidean’)
Metric used to compute the trustworthiness. For the moment only ‘euclidean’ is supported.
- convert_dtypebool, optional (default=False)
When set to True, the trustworthiness method will automatically convert the inputs to np.float32.
- batch_sizeint (default=512)
The number of samples to use for each batch.
- Returns
- trustworthiness scoredouble
Trustworthiness of the low-dimensional embedding
- cuml.metrics.cluster.adjusted_rand_index.adjusted_rand_score(labels_true, labels_pred, handle=None, convert_dtype=True) float [source]#
Adjusted_rand_score is a clustering similarity metric based on the Rand index and is corrected for chance.
- Parameters
- labels_trueGround truth labels to be used as a reference
- labels_predArray of predicted labels used to evaluate the model
- handlecuml.Handle
- Returns
- float
The adjusted rand index value between -1.0 and 1.0
- cuml.metrics.cluster.entropy.cython_entropy(clustering, base=None, handle=None) float [source]#
Computes the entropy of a distribution for given probability values.
- Parameters
- clusteringarray-like (device or host) shape = (n_samples,)
Clustering of labels. Probabilities are computed based on occurrences of labels. For instance, to represent a fair coin (2 equally possible outcomes), the clustering could be [0,1]. For a biased coin with 2/3 probability for tail, the clustering could be [0, 0, 1].
- base: float, optional
The logarithmic base to use, defaults to e (natural logarithm).
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- Returns
- Sfloat
The calculated entropy.
- cuml.metrics.cluster.homogeneity_score.cython_homogeneity_score(labels_true, labels_pred, handle=None) float [source]#
Computes the homogeneity metric of a cluster labeling given a ground truth.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
- Parameters
- labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- Returns
- float
The homogeneity of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling.
- cuml.metrics.cluster.silhouette_score.cython_silhouette_samples(X, labels, metric='euclidean', chunksize=None, convert_dtype=True, handle=None)[source]#
Calculate the silhouette coefficient for each sample in the provided data.
Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).
- Parameters
- Xarray-like, shape = (n_samples, n_features)
The feature vectors for all samples.
- labelsarray-like, shape = (n_samples,)
The assigned cluster labels for each sample.
- metricstring
A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.
- chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- cuml.metrics.cluster.silhouette_score.cython_silhouette_score(X, labels, metric='euclidean', chunksize=None, convert_dtype=True, handle=None)[source]#
Calculate the mean silhouette coefficient for the provided data.
Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).
- Parameters
- Xarray-like, shape = (n_samples, n_features)
The feature vectors for all samples.
- labelsarray-like, shape = (n_samples,)
The assigned cluster labels for each sample.
- metricstring
A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.
- chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- cuml.metrics.cluster.completeness_score.cython_completeness_score(labels_true, labels_pred, handle=None) float [source]#
Completeness metric of a cluster labeling given a ground truth.
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
- Parameters
- labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- Returns
- float
The completeness of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling.
- cuml.metrics.cluster.mutual_info_score.cython_mutual_info_score(labels_true, labels_pred, handle=None) float [source]#
Computes the Mutual Information between two clusterings.
The Mutual Information is a measure of the similarity between two labels of the same data.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
- Parameters
- handlecuml.Handle
- labels_predarray-like (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- labels_truearray-like (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- Returns
- float
Mutual information, a non-negative value
V-measure metric of a cluster labeling given a ground truth.
The V-measure is the harmonic mean between homogeneity and completeness:
v = (1 + beta) * homogeneity * completeness / (beta * homogeneity + completeness)This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is furthermore symmetric: switching
label_true
withlabel_pred
will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.Parameters#
- labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
- betafloat, default=1.0
Ratio of weight attributed to
homogeneity
vscompleteness
. Ifbeta
is greater than 1,completeness
is weighted more strongly in the calculation. Ifbeta
is less than 1,homogeneity
is weighted more strongly.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
Returns#
- v_measure_valuefloat
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
Benchmarking#
- class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]#
Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers
fit
andpredict
ortransform
methods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating
- Parameters
- cpu_classclass
Class for CPU version of algorithm. Set to None if not available.
- cuml_classclass
Class for cuML algorithm
- shared_argsdict
Arguments passed to both implementations’s initializer
- cuml_argsdict
Arguments only passed to cuml’s initializer
- cpu_args dict
Arguments only passed to sklearn’s initializer
- accepts_labelsboolean
If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.
- data_prep_hookfunction (data -> data)
Optional function to run on input data before passing to fit
- accuracy_functionfunction (y_test, y_pred)
Function that returns a scalar representing accuracy
- bench_funccustom function to perform fit/predict/transform
calls.
Methods
run_cpu
(data[, bench_args])Runs the cpu-based algorithm's fit method on specified data
run_cuml
(data[, bench_args])Runs the cuml-based algorithm's fit method on specified data
setup_cpu
setup_cuml
- cuml.benchmark.algorithms.algorithm_by_name(name)[source]#
Returns the algorithm pair with the name ‘name’ (case-insensitive)
Wrappers to run ML benchmarks
- class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]#
Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.
- class cuml.benchmark.runners.BenchmarkTimer(reps=1)[source]#
Provides a context manager that runs a code block
reps
times and records results to the instance variabletimings
. Use like:timer = BenchmarkTimer(rep=5) for _ in timer.benchmark_runs(): ... do something ... print(np.min(timer.timings))Methods
benchmark_runs
- class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]#
Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.
Methods
run
- cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], dtype=<class 'numpy.float32'>, input_type='numpy', test_fraction=0.1, run_cpu=True, device_list=('gpu',), raise_on_error=False, n_reps=1)[source]#
Runs each algo in
algos
once perbench_rows X bench_dims X params_override_list X cuml_param_override_list
combination and returns a dataframe containing timing and accuracy data.
- Parameters
- algosstr or list
Name of algorithms to run and evaluate
- dataset_namestr
Name of dataset to use
- bench_rowslist of int
Dataset row counts to test
- bench_dimslist of int
Dataset column counts to test
- param_override_listlist of dict
Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.
- cuml_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cuml algo only.
- cpu_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cpu algo only.
- dataset_param_override_listdict
Dicts containing parameters to pass to dataset generator function
- dtype: [np.float32|np.float64]
Specifies the dataset precision to be used for benchmarking.
- test_fractionfloat
The fraction of data to use for testing.
- run_cpuboolean
If True, run the cpu-based algorithm for comparison
Data generators for cuML benchmarks
The main entry point for consumers is gen_data, which wraps the underlying data generators.
Notes when writing new generators:
- Each generator is a function that accepts:
n_samples (set to 0 for ‘default’)
n_features (set to 0 for ‘default’)
random_state
(and optional generator-specific parameters)
The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)
A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.
- cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=0, n_features=0, test_fraction=0.0, datasets_root_dir='.', dtype=<class 'numpy.float32'>, **kwargs)[source]#
Returns a tuple of data from the specified generator.
- Parameters
- dataset_namestr
Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)
- dataset_formatstr
Type of data to return. (One of cudf, numpy, pandas, gpuarray)
- n_samplesint
Total number of samples to loaded including training and testing samples
- test_fractionfloat
Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.
- Returns
- (train_features, train_labels, test_features, test_labels) tuple
- containing matrices or dataframes of the requested format.
- test_features and test_labels may be None if no splitting was done.
Regression and Classification#
Linear Regression#
- class cuml.LinearRegression(*, algorithm='eig', fit_intercept=True, copy_X=None, normalize=False, handle=None, verbose=False, output_type=None)#
LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.
cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.
- Parameters
- algorithm{‘svd’, ‘eig’, ‘qr’, ‘svd-qr’, ‘svd-jacobi’}, (default = ‘eig’)
Choose an algorithm:
‘svd’ - alias for svd-jacobi;
‘eig’ - use an eigendecomposition of the covariance matrix;
‘qr’ - use QR decomposition algorithm and solve
Rx = Q^T y
‘svd-qr’ - compute SVD decomposition using QR algorithm
‘svd-jacobi’ - compute SVD decomposition using Jacobi iterations.
Among these algorithms, only ‘svd-jacobi’ supports the case when the number of features is larger than the sample size; this algorithm is force-selected automatically in such a case.
For the broad range of inputs, ‘eig’ and ‘qr’ are usually the fastest, followed by ‘svd-jacobi’ and then ‘svd-qr’. In theory, SVD-based algorithms are more stable.
- fit_interceptboolean (default = True)
If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- copy_Xbool, default=True
If True, it is guaranteed that a copy of X is created, leaving the original X unchanged. However, if set to False, X may be modified directly, which would reduce the memory usage of the estimator.
- normalizeboolean (default = False)
This parameter is ignored when
fit_intercept
is set to False. If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecatednormalize
flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Notes
LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.
Applications of LinearRegression
LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).
For additional information, see scikitlearn’s OLS documentation.
For an additional example see the OLS notebook.
Note
Starting from version 23.08, the new ‘copy_X’ parameter defaults to ‘True’, ensuring a copy of X is created after passing it to fit(), preventing any changes to the input, but with increased memory usage. This represents a change in behavior from previous versions. With
copy_X=False
a copy might still be created if necessary.Examples
>>> import cupy as cp >>> import cudf >>> # Both import methods supported >>> from cuml import LinearRegression >>> from cuml.linear_model import LinearRegression >>> lr = LinearRegression(fit_intercept = True, normalize = False, ... algorithm = "eig") >>> X = cudf.DataFrame() >>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32) >>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32) >>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32)) >>> reg = lr.fit(X,y) >>> print(reg.coef_) 0 1.0 1 2.0 dtype: float32 >>> print(reg.intercept_) 3.0... >>> X_new = cudf.DataFrame() >>> X_new['col1'] = cp.array([3,2], dtype=cp.float32) >>> X_new['col2'] = cp.array([5,5], dtype=cp.float32) >>> preds = lr.predict(X_new) >>> print(preds) 0 15.999... 1 14.999... dtype: float32
- Attributes
- coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
- intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(X, y[, convert_dtype, sample_weight])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
get_attr_names
- fit(X, y, convert_dtype=True, sample_weight=None) LinearRegression [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
Logistic Regression#
- class cuml.LogisticRegression(*, penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, class_weight=None, max_iter=1000, linesearch_max_iter=50, verbose=False, l1_ratio=None, solver='qn', handle=None, output_type=None)#
LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.
cuML’s LogisticRegression can take array-like objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It provides both single-class (using sigmoid loss) and multiple-class (using softmax loss) variants, depending on the input variablesOnly one solver option is currently available: Quasi-Newton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:
Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization
Limited Memory BFGS (L-BFGS) otherwise.
Note that, just like in Scikit-learn, the bias will not be regularized.
- Parameters
- penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)
Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then L-BFGS solver will be used. If ‘l1’ is selected, solver OWL-QN will be used. If ‘elasticnet’ is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used.
- tolfloat (default = 1e-4)
Tolerance for stopping criteria. The exact stopping conditions depend on the chosen solver. Check the solver’s documentation for more details:
- Cfloat (default = 1.0)
Inverse of regularization strength; must be a positive float.
- fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- class_weightdict or ‘balanced’, default=None
By default all classes have a weight one. However, a dictionary can be provided with weights associated with classes in the form
{class_label: weight}
. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data asn_samples / (n_classes * np.bincount(y))
. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.- max_iterint (default = 1000)
Maximum number of iterations taken for the solvers to converge.
- linesearch_max_iterint (default = 50)
Max number of linesearch iterations per outer iteration used in the lbfgs and owl QN solvers.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- l1_ratiofloat or None, optional (default=None)
The Elastic-Net mixing parameter, with
0 <= l1_ratio <= 1
- solver‘qn’ (default=’qn’)
Algorithm to use in the optimization problem. Currently only
qn
is supported, which automatically selects either L-BFGS or OWL-QN depending on the conditions of the l1 regularization described above.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Notes
cuML’s LogisticRegression uses a different solver that the equivalent Scikit-learn, except when there is no penalty and
solver=lbfgs
is used in Scikit-learn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to using different solvers in Scikit-learn.For additional information, see Scikit-learn’s LogisticRegression.
Examples
>>> import cudf >>> import numpy as np >>> # Both import methods supported >>> # from cuml import LogisticRegression >>> from cuml.linear_model import LogisticRegression >>> X = cudf.DataFrame() >>> X['col1'] = np.array([1,1,2,2], dtype = np.float32) >>> X['col2'] = np.array([1,2,2,3], dtype = np.float32) >>> y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32)) >>> reg = LogisticRegression() >>> reg.fit(X,y) LogisticRegression() >>> print(reg.coef_) 0 1 0 0.69861 0.570058 >>> print(reg.intercept_) 0 -2.188... dtype: float32 >>> X_new = cudf.DataFrame() >>> X_new['col1'] = np.array([1,5], dtype = np.float32) >>> X_new['col2'] = np.array([2,5], dtype = np.float32) >>> preds = reg.predict(X_new) >>> print(preds) 0 0.0 1 1.0 dtype: float32
- Attributes
- coef_: dev array, dim (n_classes, n_features) or (n_classes, n_features+1)
The estimated coefficients for the logistic regression model.
- intercept_: device array (n_classes, 1)
The independent term. If
fit_intercept
is False, will be 0.
Methods
decision_function
(X[, convert_dtype])Gives confidence score for X
fit
(X, y[, sample_weight, convert_dtype])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
predict
(X[, convert_dtype])Predicts the y for X.
predict_log_proba
(X[, convert_dtype])Predicts the log class probabilities for each class in X
predict_proba
(X[, convert_dtype])Predicts the class probabilities for each class in X
set_params
(**params)Accepts a dict of params and updates the corresponding ones owned by this class.
get_attr_names
- decision_function(X, convert_dtype=True) CumlArray [source]#
Gives confidence score for X
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the decision_function method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- scorecuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Confidence score
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- fit(X, y, sample_weight=None, convert_dtype=True) LogisticRegression [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- predict(X, convert_dtype=True) CumlArray [source]#
Predicts the y for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- predict_log_proba(X, convert_dtype=True) CumlArray [source]#
Predicts the log class probabilities for each class in X
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict_log_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Logaright of predicted class probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- predict_proba(X, convert_dtype=True) CumlArray [source]#
Predicts the class probabilities for each class in X
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Predicted class probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- set_params(**params)[source]#
Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the
get_param_names
method and does not need anything other than what is, there in this method, then it doesn’t have to override this method
Ridge Regression#
- class cuml.Ridge(*, alpha=1.0, solver='eig', fit_intercept=True, normalize=False, handle=None, output_type=None, verbose=False)#
Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.
cuML’s Ridge can take array-like objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It provides 3 algorithms: SVD, Eig and CD to fit a linear model. In general SVD uses significantly more memory and is slower than Eig. If using CUDA 10.1, the memory difference is even bigger than in the other supported CUDA versions. However, SVD is more stable than Eig (default). CD uses Coordinate Descent and can be faster when data is large.- Parameters
- alphafloat (default = 1.0)
Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
- solver{‘eig’, ‘svd’, ‘cd’} (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.
- fit_interceptboolean (default = True)
If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated
normalize
flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretability on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.
Applications of Ridge
Ridge Regression is used in the same way as LinearRegression, but does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.
For additional docs, see Scikit-learn’s Ridge Regression.
Examples
>>> import cupy as cp >>> import cudf >>> # Both import methods supported >>> from cuml import Ridge >>> from cuml.linear_model import Ridge >>> alpha = 1e-5 >>> ridge = Ridge(alpha=alpha, fit_intercept=True, normalize=False, ... solver="eig") >>> X = cudf.DataFrame() >>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32) >>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32) >>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32)) >>> result_ridge = ridge.fit(X, y) >>> print(result_ridge.coef_) 0 1.000... 1 1.999... >>> print(result_ridge.intercept_) 3.0... >>> X_new = cudf.DataFrame() >>> X_new['col1'] = cp.array([3,2], dtype=cp.float32) >>> X_new['col2'] = cp.array([5,5], dtype=cp.float32) >>> preds = result_ridge.predict(X_new) >>> print(preds) 0 15.999... 1 14.999...
- Attributes
- coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
- intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(X, y[, convert_dtype, sample_weight])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
set_params
(**params)Accepts a dict of params and updates the corresponding ones owned by this class.
get_attr_names
- fit(X, y, convert_dtype=True, sample_weight=None) Ridge [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- set_params(**params)[source]#
Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the
get_param_names
method and does not need anything other than what is, there in this method, then it doesn’t have to override this method
Lasso Regression#
- class cuml.Lasso(*, alpha=1.0, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)[source]#
Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.
cuML’s Lasso can take array-like objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It uses coordinate descent to fit a linear model.This estimator supports cuML’s experimental device selection capabilities. It can be configured to run on either the CPU or the GPU. To learn more, please see CPU / GPU Device Selection (Experimental).
- Parameters
- alphafloat (default = 1.0)
Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
- fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated
normalize
flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.- max_iterint (default = 1000)
The maximum number of iterations
- tolfloat (default = 1e-3)
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
- solver{‘cd’, ‘qn’} (default=’cd’)
Choose an algorithm:
‘cd’ - coordinate descent
‘qn’ - quasi-newton
You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.
- selection{‘cyclic’, ‘random’} (default=’cyclic’)
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
For additional docs, see scikitlearn’s Lasso.
Examples
>>> import numpy as np >>> import cudf >>> from cuml.linear_model import Lasso >>> ls = Lasso(alpha = 0.1, solver='qn') >>> X = cudf.DataFrame() >>> X['col1'] = np.array([0, 1, 2], dtype = np.float32) >>> X['col2'] = np.array([0, 1, 2], dtype = np.float32) >>> y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) ) >>> result_lasso = ls.fit(X, y) >>> print(result_lasso.coef_) 0 0.425 1 0.425 dtype: float32 >>> print(result_lasso.intercept_) 0.150000... >>> X_new = cudf.DataFrame() >>> X_new['col1'] = np.array([3,2], dtype = np.float32) >>> X_new['col2'] = np.array([5,5], dtype = np.float32) >>> preds = result_lasso.predict(X_new) >>> print(preds) 0 3.549997 1 3.124997 dtype: float32
- Attributes
- coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
- intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
Returns a list of hyperparameter names owned by this class.
ElasticNet Regression#
- class cuml.ElasticNet(*, alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)#
ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.
cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.
- Parameters
- alphafloat (default = 1.0)
Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
- l1_ratiofloat (default = 0.5)
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
- fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated
normalize
flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.- max_iterint (default = 1000)
The maximum number of iterations
- tolfloat (default = 1e-3)
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
- solver{‘cd’, ‘qn’} (default=’cd’)
Choose an algorithm:
‘cd’ - coordinate descent
‘qn’ - quasi-newton
You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.
- selection{‘cyclic’, ‘random’} (default=’cyclic’)
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
For additional docs, see scikitlearn’s ElasticNet.
Examples
>>> import cupy as cp >>> import cudf >>> from cuml.linear_model import ElasticNet >>> enet = ElasticNet(alpha = 0.1, l1_ratio=0.5, solver='qn') >>> X = cudf.DataFrame() >>> X['col1'] = cp.array([0, 1, 2], dtype = cp.float32) >>> X['col2'] = cp.array([0, 1, 2], dtype = cp.float32) >>> y = cudf.Series(cp.array([0.0, 1.0, 2.0], dtype = cp.float32) ) >>> result_enet = enet.fit(X, y) >>> print(result_enet.coef_) 0 0.445... 1 0.445... dtype: float32 >>> print(result_enet.intercept_) 0.108433... >>> X_new = cudf.DataFrame() >>> X_new['col1'] = cp.array([3,2], dtype = cp.float32) >>> X_new['col2'] = cp.array([5,5], dtype = cp.float32) >>> preds = result_enet.predict(X_new) >>> print(preds) 0 3.674... 1 3.228... dtype: float32
- Attributes
- coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
- intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(X, y[, convert_dtype, sample_weight])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
set_params
(**params)Accepts a dict of params and updates the corresponding ones owned by this class.
get_attr_names
- fit(X, y, convert_dtype=True, sample_weight=None) ElasticNet [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- sample_weightarray-like (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- set_params(**params)[source]#
Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the
get_param_names
method and does not need anything other than what is, there in this method, then it doesn’t have to override this method
Mini Batch SGD Classifier#
- class cuml.MBSGDClassifier(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)#
Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier:
Reduce the batch size
Increase the eta0
Increase the number of iterations
Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.
- Parameters
- loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘hinge’)
‘hinge’ uses linear SVM
‘log’ uses logistic regression
‘squared_loss’ uses linear regression
- penalty{‘none’, ‘l1’, ‘l2’, ‘elasticnet’} (default = ‘l2’)
‘none’ does not perform any regularization
‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients
‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients
‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
- alphafloat (default = 0.0001)
The constant value which decides the degree of regularization
- l1_ratiofloat (default=0.15)
The l1_ratio is used only when
penalty = elasticnet
. The value for l1_ratio should be0 <= l1_ratio <= 1
. Whenl1_ratio = 0
then thepenalty = 'l2'
and ifl1_ratio = 1
thenpenalty = 'l1'
- batch_sizeint (default = 32)
It sets the number of samples that will be included in each batch.
- fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
- tolfloat (default = 1e-3)
The training process will stop if current_loss > previous_loss - tol
- shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
- eta0float (default = 0.001)
Initial learning rate
- power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
- learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)
optimal
option will be supported in a future versionconstant
keeps the learning rate constantadaptive
changes the learning rate if the training loss or the validation accuracy does not improve forn_iter_no_change
epochs. The old learning rate is generally divided by 5- n_iter_no_changeint (default = 5)
the number of epochs to train without any improvement in the model
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s SGDClassifier.
Examples
>>> import cupy as cp >>> import cudf >>> from cuml.linear_model import MBSGDClassifier >>> X = cudf.DataFrame() >>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32) >>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32) >>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32)) >>> pred_data = cudf.DataFrame() >>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32) >>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32) >>> cu_mbsgd_classifier = MBSGDClassifier(learning_rate='constant', ... eta0=0.05, epochs=2000, ... fit_intercept=True, ... batch_size=1, tol=0.0, ... penalty='l2', ... loss='squared_loss', ... alpha=0.5) >>> cu_mbsgd_classifier.fit(X, y) MBSGDClassifier() >>> print("cuML intercept : ", cu_mbsgd_classifier.intercept_) cuML intercept : 0.725... >>> print("cuML coef : ", cu_mbsgd_classifier.coef_) cuML coef : 0 0.273... 1 0.182... dtype: float32 >>> cu_pred = cu_mbsgd_classifier.predict(pred_data) >>> print("cuML predictions : ", cu_pred) cuML predictions : 0 1.0 1 1.0 dtype: float32
Methods
fit
(X, y[, convert_dtype])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
predict
(X[, convert_dtype])Predicts the y for X.
set_params
(**params)Accepts a dict of params and updates the corresponding ones owned by this class.
- fit(X, y, convert_dtype=True) MBSGDClassifier [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- predict(X, convert_dtype=True) CumlArray [source]#
Predicts the y for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- set_params(**params)[source]#
Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the
get_param_names
method and does not need anything other than what is, there in this method, then it doesn’t have to override this method
Mini Batch SGD Regressor#
- class cuml.MBSGDRegressor(*, loss='squared_loss', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)#
Linear regression model fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Regressor implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGD Regressor:
Reduce the batch size
Increase the eta0
Increase the number of iterations
Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.
- Parameters
- loss‘squared_loss’ (default = ‘squared_loss’)
‘squared_loss’ uses linear regression
- penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
- alphafloat (default = 0.0001)
The constant value which decides the degree of regularization
- fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- l1_ratiofloat (default=0.15)
The l1_ratio is used only when
penalty = elasticnet
. The value for l1_ratio should be0 <= l1_ratio <= 1
. Whenl1_ratio = 0
then thepenalty = 'l2'
and ifl1_ratio = 1
thenpenalty = 'l1'
- batch_sizeint (default = 32)
It sets the number of samples that will be included in each batch.
- epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
- tolfloat (default = 1e-3)
The training process will stop if current_loss > previous_loss - tol
- shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
- eta0float (default = 0.001)
Initial learning rate
- power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
- learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)
optimal
option will be supported in a future versionconstant
keeps the learning rate constantadaptive
changes the learning rate if the training loss or the validation accuracy does not improve forn_iter_no_change
epochs. The old learning rate is generally divided by 5- n_iter_no_changeint (default = 5)
the number of epochs to train without any improvement in the model
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s SGDRegressor.
Examples
>>> import cupy as cp >>> import cudf >>> from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor >>> X = cudf.DataFrame() >>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32) >>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32) >>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32)) >>> pred_data = cudf.DataFrame() >>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32) >>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32) >>> cu_mbsgd_regressor = cumlMBSGDRegressor(learning_rate='constant', ... eta0=0.05, epochs=2000, ... fit_intercept=True, ... batch_size=1, tol=0.0, ... penalty='l2', ... loss='squared_loss', ... alpha=0.5) >>> cu_mbsgd_regressor.fit(X, y) MBSGDRegressor() >>> print("cuML intercept : ", cu_mbsgd_regressor.intercept_) cuML intercept : 0.725... >>> print("cuML coef : ", cu_mbsgd_regressor.coef_) cuML coef : 0 0.273... 1 0.182... dtype: float32 >>> cu_pred = cu_mbsgd_regressor.predict(pred_data) >>> print("cuML predictions : ", cu_pred) cuML predictions : 0 2.456... 1 2.183... dtype: float32
Methods
fit
(X, y[, convert_dtype])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
predict
(X[, convert_dtype])Predicts the y for X.
set_params
(**params)Accepts a dict of params and updates the corresponding ones owned by this class.
- fit(X, y, convert_dtype=True) MBSGDRegressor [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- predict(X, convert_dtype=True) CumlArray [source]#
Predicts the y for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- set_params(**params)[source]#
Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the
get_param_names
method and does not need anything other than what is, there in this method, then it doesn’t have to override this method
Multiclass Classification#
- class cuml.multiclass.MulticlassClassifier(estimator, *, handle=None, verbose=False, output_type=None, strategy='ovr')[source]#
Wrapper around scikit-learn multiclass classifiers that allows to choose different multiclass strategies.
The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue rapidsai/cuml#2876.
- Parameters
- estimatorcuML estimator
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- strategy: string {‘ovr’, ‘ovo’}, default=’ovr’
Multiclass classification strategy: ‘ovr’: one vs. rest or ‘ovo’: one vs. one
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import MulticlassClassifier >>> from cuml.datasets.classification import make_classification >>> X, y = make_classification(n_samples=10, n_features=6, ... n_informative=4, n_classes=3, ... random_state=137) >>> cls = MulticlassClassifier(LogisticRegression(), strategy='ovo') >>> cls.fit(X,y) MulticlassClassifier() >>> cls.predict(X) array([1, 1, 1, 1, 1, 1, 2, 1, 1, 2])
- Attributes
- classes_float, shape (
n_classes_
) Array of class labels.
- n_classes_int
Number of classes.
- classes_float, shape (
Methods
Calculate the decision function.
fit
(X, y)Fit a multiclass classifier.
Returns a list of hyperparameter names owned by this class.
predict
(X)Predict using multi class classifier.
- decision_function(X) CumlArray [source]#
Calculate the decision function.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns
- resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Decision function values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- fit(X, y) MulticlassClassifier [source]#
Fit a multiclass classifier.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- predict(X) CumlArray [source]#
Predict using multi class classifier.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- class cuml.multiclass.OneVsOneClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]#
Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue rapidsai/cuml#2876.
For documentation see scikit-learn’s OneVsOneClassifier.
- Parameters
- estimatorcuML estimator
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import OneVsOneClassifier >>> from cuml.datasets.classification import make_classification >>> X, y = make_classification(n_samples=10, n_features=6, ... n_informative=4, n_classes=3, ... random_state=137) >>> cls = OneVsOneClassifier(LogisticRegression()) >>> cls.fit(X,y) OneVsOneClassifier() >>> cls.predict(X) array([1, 1, 1, 1, 1, 1, 2, 1, 1, 2])
Methods
Returns a list of hyperparameter names owned by this class.
- class cuml.multiclass.OneVsRestClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]#
Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue rapidsai/cuml#2876.
For documentation see scikit-learn’s OneVsRestClassifier.
- Parameters
- estimatorcuML estimator
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import OneVsRestClassifier >>> from cuml.datasets.classification import make_classification >>> X, y = make_classification(n_samples=10, n_features=6, ... n_informative=4, n_classes=3, ... random_state=137) >>> cls = OneVsRestClassifier(LogisticRegression()) >>> cls.fit(X,y) OneVsRestClassifier() >>> cls.predict(X) array([1, 1, 1, 1, 1, 1, 2, 1, 1, 2])
Methods
Returns a list of hyperparameter names owned by this class.
Naive Bayes#
- class cuml.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#
Naive Bayes classifier for multinomial models
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
- Parameters
- alphafloat (default=1.0)
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
- fit_priorboolean (default=True)
Whether to learn class prior probabilities or no. If false, a uniform prior will be used.
- class_priorarray-like, size (n_classes) (default=None)
Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Examples
Load the 20 newsgroups dataset from Scikit-learn and train a Naive Bayes classifier.
>>> import cupy as cp >>> import cupyx >>> from sklearn.datasets import fetch_20newsgroups >>> from sklearn.feature_extraction.text import CountVectorizer >>> from cuml.naive_bayes import MultinomialNB >>> # Load corpus >>> twenty_train = fetch_20newsgroups(subset='train', shuffle=True, ... random_state=42) >>> # Turn documents into term frequency vectors >>> count_vect = CountVectorizer() >>> features = count_vect.fit_transform(twenty_train.data) >>> # Put feature vectors and labels on the GPU >>> X = cupyx.scipy.sparse.csr_matrix(features.tocsr(), ... dtype=cp.float32) >>> y = cp.asarray(twenty_train.target, dtype=cp.int32) >>> # Train model >>> model = MultinomialNB() >>> model.fit(X, y) MultinomialNB() >>> # Compute accuracy on training set >>> model.score(X, y) 0.9245...
- Attributes
- class_count_ndarray of shape (n_classes)
Number of samples encountered for each class during fitting.
- class_log_prior_ndarray of shape (n_classes)
Log probability of each class (smoothed).
- classes_ndarray of shape (n_classes,)
Class labels known to the classifier
- feature_count_ndarray of shape (n_classes, n_features)
Number of samples encountered for each (class, feature) during fitting.
- feature_log_prob_ndarray of shape (n_classes, n_features)
Empirical log probability of features given a class, P(x_i|y).
- n_features_int
Number of features of each sample.
- class cuml.naive_bayes.BernoulliNB(*, alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#
Naive Bayes classifier for multivariate Bernoulli models. Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.
- Parameters
- alphafloat, default=1.0
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
- binarizefloat or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
- fit_priorbool, default=True
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
- class_priorarray-like of shape (n_classes,), default=None
Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
References
C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html A. McCallum and K. Nigam (1998). A comparison of event models for naive Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with naive Bayes – Which naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).
Examples
>>> import cupy as cp >>> rng = cp.random.RandomState(1) >>> X = rng.randint(5, size=(6, 100), dtype=cp.int32) >>> Y = cp.array([1, 2, 3, 4, 4, 5]) >>> from cuml.naive_bayes import BernoulliNB >>> clf = BernoulliNB() >>> clf.fit(X, Y) BernoulliNB() >>> print(clf.predict(X[2:3])) [3]
- Attributes
- class_count_ndarray of shape (n_classes)
Number of samples encountered for each class during fitting.
- class_log_prior_ndarray of shape (n_classes)
Log probability of each class (smoothed).
- classes_ndarray of shape (n_classes,)
Class labels known to the classifier
- feature_count_ndarray of shape (n_classes, n_features)
Number of samples encountered for each (class, feature) during fitting.
- feature_log_prob_ndarray of shape (n_classes, n_features)
Empirical log probability of features given a class, P(x_i|y).
- n_features_int
Number of features of each sample.
Methods
Returns a list of hyperparameter names owned by this class.
- class cuml.naive_bayes.ComplementNB(*, alpha=1.0, fit_prior=True, class_prior=None, norm=False, output_type=None, handle=None, verbose=False)[source]#
The Complement Naive Bayes classifier described in Rennie et al. (2003). The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets.
- Parameters
- alphafloat, default=1.0
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
- fit_priorbool, default=True
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
- class_priorarray-like of shape (n_classes,), default=None
Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
- normbool, default=False
Whether or not a second normalization of the weights is performed. The default behavior mirrors the implementation found in Mahout and Weka, which do not follow the full algorithm described in Table 9 of the paper.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
References
Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text classifiers. In ICML (Vol. 3, pp. 616-623). https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
Examples
>>> import cupy as cp >>> rng = cp.random.RandomState(1) >>> X = rng.randint(5, size=(6, 100), dtype=cp.int32) >>> Y = cp.array([1, 2, 3, 4, 4, 5]) >>> from cuml.naive_bayes import ComplementNB >>> clf = ComplementNB() >>> clf.fit(X, Y) ComplementNB() >>> print(clf.predict(X[2:3])) [3]
- Attributes
- class_count_ndarray of shape (n_classes)
Number of samples encountered for each class during fitting.
- class_log_prior_ndarray of shape (n_classes)
Log probability of each class (smoothed).
- classes_ndarray of shape (n_classes,)
Class labels known to the classifier
- feature_count_ndarray of shape (n_classes, n_features)
Number of samples encountered for each (class, feature) during fitting.
- feature_log_prob_ndarray of shape (n_classes, n_features)
Empirical log probability of features given a class, P(x_i|y).
- n_features_int
Number of features of each sample.
Methods
Returns a list of hyperparameter names owned by this class.
- class cuml.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09, output_type=None, handle=None, verbose=False)[source]#
Gaussian Naive Bayes (GaussianNB) Can perform online updates to model parameters via
partial_fit()
. For details on algorithm used to update feature means and variance online, see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:- Parameters
- priorsarray-like of shape (n_classes,)
Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
- var_smoothingfloat, default=1e-9
Portion of the largest variance of all features that is added to variances for calculation stability.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Examples
>>> import cupy as cp >>> X = cp.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], ... [3, 2]], cp.float32) >>> Y = cp.array([1, 1, 1, 2, 2, 2], cp.float32) >>> from cuml.naive_bayes import GaussianNB >>> clf = GaussianNB() >>> clf.fit(X, Y) GaussianNB() >>> print(clf.predict(cp.array([[-0.8, -1]], cp.float32))) [1] >>> clf_pf = GaussianNB() >>> clf_pf.partial_fit(X, Y, cp.unique(Y)) GaussianNB() >>> print(clf_pf.predict(cp.array([[-0.8, -1]], cp.float32))) [1]
Methods
fit
(X, y[, sample_weight])Fit Gaussian Naive Bayes classifier according to X, y
Returns a list of hyperparameter names owned by this class.
partial_fit
(X, y[, classes, sample_weight])Incremental fit on a batch of samples.
- fit(X, y, sample_weight=None) GaussianNB [source]#
Fit Gaussian Naive Bayes classifier according to X, y
- Parameters
- X{array-like, cupy sparse matrix} of shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.
- yarray-like shape (n_samples) Target values.
- sample_weightarray-like of shape (n_samples)
Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- partial_fit(X, y, classes=None, sample_weight=None) GaussianNB [source]#
Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
- Parameters
- X{array-like, cupy sparse matrix} of shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features. A sparse matrix in COO format is preferred, other formats will go through a conversion to COO.
- yarray-like of shape (n_samples) Target values.
- classesarray-like of shape (n_classes)
List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
- sample_weightarray-like of shape (n_samples)
Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.
- Returns
- selfobject
- class cuml.naive_bayes.CategoricalNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#
Naive Bayes classifier for categorical features The categorical Naive Bayes classifier is suitable for classification with discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.
- Parameters
- alphafloat, default=1.0
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
- fit_priorbool, default=True
Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
- class_priorarray-like of shape (n_classes,), default=None
Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Examples
>>> import cupy as cp >>> rng = cp.random.RandomState(1) >>> X = rng.randint(5, size=(6, 100), dtype=cp.int32) >>> y = cp.array([1, 2, 3, 4, 5, 6]) >>> from cuml.naive_bayes import CategoricalNB >>> clf = CategoricalNB() >>> clf.fit(X, y) CategoricalNB() >>> print(clf.predict(X[2:3])) [3]
- Attributes
- category_count_ndarray of shape (n_features, n_classes, n_categories)
With n_categories being the highest category of all the features. This array provides the number of samples encountered for each feature, class and category of the specific feature.
- class_count_ndarray of shape (n_classes,)
Number of samples encountered for each class during fitting.
- class_log_prior_ndarray of shape (n_classes,)
Smoothed empirical log probability for each class.
- classes_ndarray of shape (n_classes,)
Class labels known to the classifier
- feature_log_prob_ndarray of shape (n_features, n_classes, n_categories)
With n_categories being the highest category of all the features. Each array of shape (n_classes, n_categories) provides the empirical log probability of categories given the respective feature and class,
P(x_i|y)
. This attribute is not available when the model has been trained with sparse data.- n_features_int
Number of features of each sample.
Methods
fit
(X, y[, sample_weight])Fit Naive Bayes classifier according to X, y
partial_fit
(X, y[, classes, sample_weight])Incremental fit on a batch of samples.
- fit(X, y, sample_weight=None) CategoricalNB [source]#
Fit Naive Bayes classifier according to X, y
- Parameters
- Xarray-like of shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.
- yarray-like of shape (n_samples,)
Target values.
- sample_weightarray-like of shape (n_samples), default=None
Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.
- Returns
- selfobject
- partial_fit(X, y, classes=None, sample_weight=None) CategoricalNB [source]#
Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.
- yarray-like of shape (n_samples)
Target values.
- classesarray-like of shape (n_classes), default=None
List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
- sample_weightarray-like of shape (n_samples), default=None
Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.
- Returns
- selfobject
Stochastic Gradient Descent#
- class cuml.SGD(*, loss='squared_loss', penalty='none', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, output_type=None, verbose=False)#
Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.
cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.
- Parameters
- loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)
‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression
- penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
- alphafloat (default = 0.0001)
The constant value which decides the degree of regularization
- fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
- epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
- tolfloat (default = 1e-3)
The training process will stop if current_loss > previous_loss - tol
- shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
- eta0float (default = 0.001)
Initial learning rate
- power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
- batch_sizeint (default=32)
The number of samples to use for each batch.
- learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)
Optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5
- n_iter_no_changeint (default = 5)
The number of epochs to train without any improvement in the model
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Examples
>>> import numpy as np >>> import cudf >>> from cuml.solvers import SGD as cumlSGD >>> X = cudf.DataFrame() >>> X['col1'] = np.array([1,1,2,2], dtype=np.float32) >>> X['col2'] = np.array([1,2,2,3], dtype=np.float32) >>> y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32)) >>> pred_data = cudf.DataFrame() >>> pred_data['col1'] = np.asarray([3, 2], dtype=np.float32) >>> pred_data['col2'] = np.asarray([5, 5], dtype=np.float32) >>> cu_sgd = cumlSGD(learning_rate='constant', eta0=0.005, epochs=2000, ... fit_intercept=True, batch_size=2, ... tol=0.0, penalty='none', loss='squared_loss') >>> cu_sgd.fit(X, y) SGD() >>> cu_pred = cu_sgd.predict(pred_data).to_numpy() >>> print(" cuML intercept : ", cu_sgd.intercept_) cuML intercept : 0.00418... >>> print(" cuML coef : ", cu_sgd.coef_) cuML coef : 0 0.9841... 1 0.0097... dtype: float32 >>> print("cuML predictions : ", cu_pred) cuML predictions : [3.0055... 2.0214...]
- Attributes
- classes_
- coef_
Methods
fit
(X, y[, convert_dtype])Fit the model with X and y.
Returns a list of hyperparameter names owned by this class.
predict
(X[, convert_dtype])Predicts the y for X.
predictClass
(X[, convert_dtype])Predicts the y for X.
- fit(X, y, convert_dtype=True) SGD [source]#
Fit the model with X and y.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- get_param_names()[source]#
Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of
get_params
andset_params
methods.
- predict(X, convert_dtype=True) CumlArray [source]#
Predicts the y for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
- predictClass(X, convert_dtype=True) CumlArray [source]#
Predicts the y for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predictClass method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- Returns
- predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Random Forest#
- class cuml.ensemble.RandomForestClassifier(*, split_criterion=0, handle=None, verbose=False, output_type=None, **kwargs)#
Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.
Note
Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the
n_bins
parameter.Note
You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.
- Parameters
- n_estimatorsint (default = 100)
Number of trees in the forest. (Default changed to 100 in cuML 0.11)
- split_criterionint or string (default =
0
('gini'
)) The criterion used to split nodes.
0
or'gini'
for gini impurity1
or'entropy'
for information gain (entropy)2
or'mse'
for mean squared error4
or'poisson'
for poisson half deviance5
or'gamma'
for gamma half deviance6
or'inverse_gaussian'
for inverse gaussian deviance
only
0
/'gini'
and1
/'entropy'
valid for classification- bootstrapboolean (default = True)
Control bootstrapping.
If
True
, eachtree in the forest is built on a bootstrapped sample with replacement.If
False
, the whole dataset is used to build each tree.
- max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
- max_depthint (default = 16)
Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.
Note
This default differs from scikit-learn’s random forest, which defaults to unlimited depth.
- max_leavesint (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, If
-1
.- max_featuresint, float, or string (default = ‘sqrt’)
Ratio of number of features (columns) to consider per node split.
If type
int
thenmax_features
is the absolute count of features to be usedIf type
float
thenmax_features
is used as a fraction.If
'sqrt'
thenmax_features=1/sqrt(n_features)
.If
'log2'
thenmax_features=log2(n_features)/n_features
.
Changed in version 24.06: The default of
max_features
changed from"auto"
to"sqrt"
.- n_binsint (default = 128)
Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.
- n_streamsint (default = 4)
Number of parallel streams used for forest building.
- min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If type
int
, thenmin_samples_leaf
represents the minimum number.If
float
, thenmin_samples_leaf
represents a fraction andceil(min_samples_leaf * n_rows)
is the minimum number of samples for each leaf node.
- min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node.
If type
int
, then min_samples_split represents the minimum number.If type
float
, thenmin_samples_split
represents a fraction andmax(2, ceil(min_samples_split * n_rows))
is the minimum number of samples for each split.
- min_impurity_decreasefloat (default = 0.0)
Minimum decrease in impurity required for node to be split.
- max_batch_sizeint (default = 4096)
Maximum number of nodes that can be processed in a given batch.
- random_stateint (default = None)
Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.
- handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type
) will be used. See Output Data Type Configuration for more info.
Notes
Known Limitations
This is an early release of the cuML Random Forest code. It contains a few known limitations:
GPU-based inference is only supported with 32-bit (float32) datatypes. Alternatives are to use CPU-based inference for 64-bit (float64) datatypes, or let the default automatic datatype conversion occur during GPU inference.
While training the model for multi class classification problems, using deep trees or
max_features=1.0
provides better performance.
For additional docs, see scikitlearn’s RandomForestClassifier.
Examples
>>> import cupy as cp >>> from cuml.ensemble import RandomForestClassifier as cuRFC >>> X = cp.random.normal(size=(10,4)).astype(cp.float32) >>> y = cp.asarray([0,1]*5, dtype=cp.int32) >>> cuml_model = cuRFC(max_features=1.0, ... n_bins=8, ... n_estimators=40) >>> cuml_model.fit(X,y) RandomForestClassifier() >>> cuml_predict = cuml_model.predict(X) >>> print("Predicted labels : ", cuml_predict) Predicted labels : [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]
Methods
convert_to_fil_model
([output_class, ...])Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
Converts the cuML RF model to a Treelite model
fit
(X, y[, convert_dtype])Perform Random Forest Classification on the input data
Obtain the detailed information for the random forest model, as text
get_json
()Export the Random Forest model as a JSON string
Obtain the text summary of the random forest model
predict
(X[, predict_model, threshold, algo, ...])Predicts the labels for X.
predict_proba
(X[, algo, convert_dtype, ...])Predicts class probabilities for X.
score
(X, y[, threshold, algo, ...])Calculates the accuracy metric score of the model for X.
- convert_to_fil_model(output_class=True, threshold=0.5, algo='auto', fil_sparse_format='auto')[source]#
Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
- Parameters
- output_classboolean (default = True)
This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.
- algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
- simple inference using shared memory'tree_reorg'
- similar to naive but trees rearranged to be more coalescing-friendly'batch_tree_reorg'
- similar to tree_reorg but predicting multiple rows per thread block'auto'
- choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
- thresholdfloat (default = 0.5)
Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored
- fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
- choose the storage type automatically (currently True is chosen by auto)False
- create a dense forestTrue
- create a sparse forest, requires algo=’naive’ or algo=’auto’
- Returns
- fil_model
A Forest Inference model which can be used to perform inferencing on the random forest model.
- convert_to_treelite_model()[source]#
Converts the cuML RF model to a Treelite model
- Returns
- tl_to_fil_modelTreelite version of this model
- fit(X, y, convert_dtype=True)[source]#
Perform Random Forest Classification on the input data
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- convert_dtypebool, optional (default = True)
When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.
- predict(X, predict_model='GPU', threshold=0.5, algo='auto', convert_dtype=True, fil_sparse_format='auto') CumlArray [source]#
Predicts the labels for X.
- Parameters
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- predict_modelString (default = ‘GPU’)
‘GPU’ to predict using the GPU, ‘CPU’ otherwise.
- algostring (default =
'auto'
) This is optional and required only while performing the predict