API Reference#

Module Configuration#

Output Data Type Configuration#

cuml.internals.memory_utils.set_global_output_type(output_type)[source]#
Method to set cuML’s single GPU estimators global output type. It will be used by all estimators unless overridden in their initialization with their own output_type parameter. Can also be overridden by the context manager method using_output_type().

Parameters:

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.

'input' will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

Input type

Output type

cuDF DataFrame or Series

cuDF DataFrame or Series

NumPy arrays

NumPy arrays

Pandas DataFrame or Series

NumPy arrays

Numba device arrays

Numba device arrays

CuPy arrays

CuPy arrays

Other __cuda_array_interface__ objs

CuPy arrays

'cudf' will return cuDF Series for single dimensional results and DataFrames for the rest.

'cupy' will return CuPy arrays.

'numpy' will return NumPy arrays.

Notes

'cupy' and 'numba' options (as well as 'input' when using Numba and CuPy ndarrays for input) have the least overhead. cuDF add memory consumption and processing time needed to build the Series and DataFrames. 'numpy' has the biggest overhead due to the need to transfer data to CPU memory.

Examples
>>> import cuml
>>> import cupy as cp
>>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
>>> ary = cp.asarray(ary)
>>> prev_output_type = cuml.global_settings.output_type
>>> cuml.set_global_output_type('cudf')
>>> dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
>>> dbscan_float.fit(ary)
DBSCAN()
>>>
>>> # cuML output type
>>> dbscan_float.labels_
0    0
1    1
2    2
dtype: int32
>>> type(dbscan_float.labels_)
<class 'cudf.core.series.Series'>
>>> cuml.set_global_output_type(prev_output_type)
cuml.internals.memory_utils.using_output_type(output_type)[source]#
Context manager method to set cuML’s global output type inside a with statement. It gets reset to the prior value it had once the with code block is executer.

Parameters:

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.

'input' will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

Input type

Output type

cuDF DataFrame or Series

cuDF DataFrame or Series

NumPy arrays

NumPy arrays

Pandas DataFrame or Series

NumPy arrays

Numba device arrays

Numba device arrays

CuPy arrays

CuPy arrays

Other __cuda_array_interface__ objs

CuPy arrays

'cudf' will return cuDF Series for single dimensional results and DataFrames for the rest.

'cupy' will return CuPy arrays.

'numpy' will return NumPy arrays.

Examples
>>> import cuml
>>> import cupy as cp
>>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
>>> ary = cp.asarray(ary)
>>> with cuml.using_output_type('cudf'):
...     dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
...     dbscan_float.fit(ary)
...
...     print("cuML output inside 'with' context")
...     print(dbscan_float.labels_)
...     print(type(dbscan_float.labels_))
...
DBSCAN()
cuML output inside 'with' context
0    0
1    1
2    2
dtype: int32
<class 'cudf.core.series.Series'>
>>> # use cuml again outside the context manager
>>> dbscan_float2 = cuml.DBSCAN(eps=1.0, min_samples=1)
>>> dbscan_float2.fit(ary)
DBSCAN()
>>> # cuML default output
>>> dbscan_float2.labels_
array([0, 1, 2], dtype=int32)
>>> isinstance(dbscan_float2.labels_, cp.ndarray)
True

Input type	Output type
cuDF DataFrame or Series	cuDF DataFrame or Series
NumPy arrays	NumPy arrays
Pandas DataFrame or Series	NumPy arrays
Numba device arrays	Numba device arrays
CuPy arrays	CuPy arrays
Other `__cuda_array_interface__` objs	CuPy arrays

Input type	Output type
cuDF DataFrame or Series	cuDF DataFrame or Series
NumPy arrays	NumPy arrays
Pandas DataFrame or Series	NumPy arrays
Numba device arrays	Numba device arrays
CuPy arrays	CuPy arrays
Other `__cuda_array_interface__` objs	CuPy arrays

Verbosity Levels#

cuML follows a verbosity model similar to Scikit-learn’s: The verbose parameter can be a boolean, or a numeric value, and higher numeric values mean more verbosity. The exact values can be set directly, or through the cuml.common.logger module, and they are:

Verbosity Levels#
Numeric value	cuml.common.logger value	Verbosity level
0	cuml.common.logger.level_enum.off	Disables all log messages
1	cuml.common.logger.level_enum.critical	Enables only critical messages
2	cuml.common.logger.level_enum.error	Enables all messages up to and including errors.
3	cuml.common.logger.level_enum.warn	Enables all messages up to and including warnings.
4 or False	cuml.common.logger.level_enum.info	Enables all messages up to and including information messages.
5 or True	cuml.common.logger.level_enum.debug	Enables all messages up to and including debug messages.
6	cuml.common.logger.level_enum.trace	Enables all messages up to and including trace messages.

Preprocessing, Metrics, and Utilities#

Model Selection and Data Splitting#

cuml.model_selection.train_test_split(X, y=None, test_size: float | int | None = None, train_size: float | int | None = None, shuffle: bool = True, random_state: int | RandomState | RandomState | None = None, stratify=None)[source]#
Partitions device data into four collated objects, mimicking Scikit-learn’s train_test_split.

Parameters:

Xcudf.DataFrame or cuda_array_interface compliant device array
Data to split, has shape (n_samples, n_features)

ystr, cudf.Series or cuda_array_interface compliant device array
Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X (if it is a cuDF DataFrame) containing the labels

train_sizefloat or int, optional
If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8

shufflebool, optional
Whether or not to shuffle inputs before splitting

random_stateint, CuPy RandomState or NumPy RandomState optional
If shuffle is true, seeds the generator. Unseeded by default

stratify: cudf.Series or cuda_array_interface compliant device array,
optional parameter. When passed, the input is split using this as column to startify on. Default=None

Returns:

X_train, X_test, y_train, y_testcudf.DataFrame or array-like objects
Partitioned dataframes if X and y were cuDF objects. If y was provided as a column name, the column was dropped from X. Partitioned numba device arrays if X and y were Numba device arrays. Partitioned CuPy arrays for any other input.

Examples
>>> import cudf
>>> from cuml.model_selection import train_test_split
>>> # Generate some sample data
>>> df = cudf.DataFrame({'x': range(10),
...                      'y': [0, 1] * 5})
>>> print(f'Original data: {df.shape[0]} elements')
Original data: 10 elements
>>> # Suppose we want an 80/20 split
>>> X_train, X_test, y_train, y_test = train_test_split(df, 'y',
...                                                     train_size=0.8)
>>> print(f'X_train: {X_train.shape[0]} elements')
X_train: 8 elements
>>> print(f'X_test: {X_test.shape[0]} elements')
X_test: 2 elements
>>> print(f'y_train: {y_train.shape[0]} elements')
y_train: 8 elements
>>> print(f'y_test: {y_test.shape[0]} elements')
y_test: 2 elements

>>> # Alternatively, if our labels are stored separately
>>> labels = df['y']
>>> df = df.drop(['y'], axis=1)
>>> # we can also do
>>> X_train, X_test, y_train, y_test = train_test_split(df, labels,
...                                                     train_size=0.8)

Feature and Label Encoding (Single-GPU)#

class cuml.preprocessing.LabelEncoder.LabelEncoder(*, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]#
An nvcategory based implementation of ordinal label encoding

Parameters:

handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False
Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Methods

fit(y)

Fit a LabelEncoder instance to a set of categories.

fit_transform(y)

Simultaneously fit and transform an input

inverse_transform(y)

Revert ordinal label to original label

transform(y)

Transform an input into its categorical keys.

Examples

Converting a categorical implementation to a numerical one
>>> from cudf import DataFrame, Series
>>> from cuml.preprocessing import LabelEncoder
>>> data = DataFrame({'category': ['a', 'b', 'c', 'd']})
>>> # There are two functionally equivalent ways to do this
>>> le = LabelEncoder()
>>> le.fit(data.category)  # le = le.fit(data.category) also works
LabelEncoder()
>>> encoded = le.transform(data.category)
>>> print(encoded)
0    0
1    1
2    2
3    3
dtype: uint8
>>> # This method is preferred
>>> le = LabelEncoder()
>>> encoded = le.fit_transform(data.category)
>>> print(encoded)
0    0
1    1
2    2
3    3
dtype: uint8
>>> # We can assign this to a new column
>>> data = data.assign(encoded=encoded)
>>> print(data.head())
category  encoded
0         a        0
1         b        1
2         c        2
3         d        3
>>> # We can also encode more data
>>> test_data = Series(['c', 'a'])
>>> encoded = le.transform(test_data)
>>> print(encoded)
0    2
1    0
dtype: uint8
>>> # After train, ordinal label can be inverse_transform() back to
>>> # string labels
>>> ord_label = cudf.Series([0, 0, 1, 2, 1])
>>> str_label = le.inverse_transform(ord_label)
>>> print(str_label)
0    a
1    a
2    b
3    c
4    b
dtype: object
fit(y)[source]#

Fit a LabelEncoder instance to a set of categories.

Parameters:

ycudf.Series, pandas.Series, cupy.ndarray or numpy.ndarray
The target values to encode.

Returns:

selfLabelEncoder

fit_transform(y)[source]#

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)

inverse_transform(y: Series)[source]#

Revert ordinal label to original label

Parameters:

ycudf.Series, pandas.Series, cupy.ndarray or numpy.ndarray
dtype=int32 Ordinal labels to be reverted

Returns:

revertedthe same type as y
Reverted labels

transform(y)[source]#

Transform an input into its categorical keys.

This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.

Parameters:

ycudf.Series, pandas.Series, cupy.ndarray or numpy.ndarray
Input keys to be transformed. Its values should match the categories given to fit

Returns:

encodedcudf.Series
The ordinally encoded input series

Raises:

KeyError
if a category appears that was not seen in fit
class cuml.preprocessing.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False, handle=None, verbose=False, output_type=None)[source]#
A multi-class dummy encoder for labels.

Parameters:

neg_labelinteger (default=0)
label to be used as the negative binary label

pos_labelinteger (default=1)
label to be used as the positive binary label

sparse_outputbool (default=False)
whether to return sparse arrays for transformed output

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False
Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

classes_

Methods

fit(y)

Fit label binarizer

fit_transform(y)

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

inverse_transform(y, *[, threshold])

Transform binary labels back to original multi-class labels

transform(y)

Transform multi-class labels to their dummy-encoded representation labels.

Examples

Create an array with labels and dummy encode them
>>> import cupy as cp
>>> import cupyx
>>> from cuml.preprocessing import LabelBinarizer

>>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
...                     dtype=cp.int32)

>>> lb = LabelBinarizer()
>>> encoded = lb.fit_transform(labels)
>>> print(str(encoded))
[[1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0]
[0 1 0 0 0 0 0 0]]
>>> decoded = lb.inverse_transform(encoded)
>>> print(str(decoded))
[ 0  5 10  7  2  4  1  0  0  4  3  2  1]
fit(y) → LabelBinarizer[source]#

Fit label binarizer

Parameters:

yarray of shape [n_samples,] or [n_samples, n_classes]
Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns:

selfreturns an instance of self.

fit_transform(y) → SparseCumlArray[source]#

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

Parameters:

yarray of shape [n_samples,] or [n_samples, n_classes]

Returns:

arrarray with encoded labels

inverse_transform(y, *, threshold=None) → CumlArray[source]#

Transform binary labels back to original multi-class labels

Parameters:

yarray of shape [n_samples, n_classes]

thresholdfloat this value is currently ignored

Returns:

arrarray with original labels

transform(y) → SparseCumlArray[source]#

Transform multi-class labels to their dummy-encoded representation labels.

Parameters:

yarray of shape [n_samples,] or [n_samples, n_classes]

Returns:

arrarray with encoded labels
cuml.preprocessing.label_binarize(y, classes, neg_label=0, pos_label=1, sparse_output=False) → SparseCumlArray[source]#

A stateless helper function to dummy encode multi-class labels.

Parameters:

yarray-like of size [n_samples,] or [n_samples, n_classes]

classesthe set of unique classes in the input

neg_labelinteger the negative value for transformed output

pos_labelinteger the positive value for transformed output

sparse_outputbool whether to return sparse array

class cuml.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse_output=True, dtype=<class 'numpy.float32'>, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]#

Encode categorical features as a one-hot numeric array. The input to this estimator should be a cuDF.DataFrame or a cupy.ndarray, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter).

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

Note

a one-hot encoding of y labels should use a LabelBinarizer instead.

Parameters:

categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’

Categories (unique values) per feature:

‘auto’ : Determine categories automatically from the training data.

DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None, a dict or a list, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

None : retain all features (the default).

‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

dict/list : drop[col] is the category in feature col that should be dropped.

sparse_outputbool, default=True
This feature is not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See cupy/cupy#3223

Added in version 24.06: sparse was renamed to sparse_output

dtypenumber type, default=np.float
Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False
Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

drop_idx_array of shape (n_features,)
drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature. None if all the transformed features will be retained.

Methods

fit(X[, y])

Fit OneHotEncoder to X.

fit_transform(X[, y])

Fit OneHotEncoder to X, then transform X.

get_feature_names([input_features])

Return feature names for output features.

inverse_transform(X)

Convert the data back to the original representation.

transform(X)

Transform X using one-hot encoding.

fit(X, y=None)[source]#

Fit OneHotEncoder to X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yNone
Ignored. This parameter exists for compatibility only.

fit_transform(X, y=None)[source]#

Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yNone
Ignored. This parameter exists for compatibility only.

Returns:

X_outsparse matrix if sparse_output=True else a 2-d array
Transformed input.

get_feature_names(input_features=None)[source]#

Return feature names for output features.

Parameters:

input_featureslist of str of shape (n_features,)
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns:

output_feature_namesndarray of shape (n_output_features,)
Array of feature names.

inverse_transform(X)[source]#

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

The return type is the same as the type of the input used by the first call to fit on this estimator instance.

Parameters:

Xarray-like or sparse matrix, shape [n_samples, n_encoded_features]
The transformed data.

Returns:

X_trcudf.DataFrame or cupy.ndarray
Inverse transformed array.

transform(X)[source]#

Transform X using one-hot encoding. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

X_outsparse matrix if sparse_output=True else a 2-d array
Transformed input.
class cuml.preprocessing.TargetEncoder.TargetEncoder(n_folds=4, smooth=0, seed=42, split_method='interleaved', output_type='auto', stat='mean')[source]#
A cudf based implementation of target encoding [1], which converts one or multiple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns Xs and the aggregated mean value of Y of each group is calculated to replace each value of Xs. Several optimizations are applied to prevent label leakage and parallelize the execution.

Parameters:

n_foldsint (default=4)
Default number of folds for fitting training data. To prevent label leakage in fit, we split data into n_folds and encode one fold using the target variables of the remaining folds.

smoothint or float (default=0)
Count of samples to smooth the encoding. 0 means no smoothing.

seedint (default=42)
Random seed

split_method{‘random’, ‘continuous’, ‘interleaved’}, (default=’interleaved’)
Method to split train data into n_folds. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. ‘customize’: customize splitting by providing a fold_ids array in fit() or fit_transform() functions.

output_type{‘cupy’, ‘numpy’, ‘auto’}, default = ‘auto’
The data type of output. If ‘auto’, it matches input data.

stat{‘mean’,’var’,’median’}, default = ‘mean’
The statistic used in encoding, mean, variance or median of the target.

Methods

fit(x, y[, fold_ids])

Fit a TargetEncoder instance to a set of categories

fit_transform(x, y[, fold_ids])

Simultaneously fit and transform an input

get_params([deep])

Returns a dict of all params owned by this class.

transform(x)

Transform an input into its categorical keys.

References

[1]
https://maxhalford.github.io/blog/target-encoding/

Examples

Converting a categorical implementation to a numerical one
>>> from cudf import DataFrame, Series
>>> from cuml.preprocessing import TargetEncoder
>>> train = DataFrame({'category': ['a', 'b', 'b', 'a'],
...                    'label': [1, 0, 1, 1]})
>>> test = DataFrame({'category': ['a', 'c', 'b', 'a']})
>>> encoder = TargetEncoder()
>>> train_encoded = encoder.fit_transform(train.category, train.label)
>>> test_encoded = encoder.transform(test.category)
>>> print(train_encoded)
[1. 1. 0. 1.]
>>> print(test_encoded)
[1.   0.75 0.5  1.  ]
fit(x, y, fold_ids=None)[source]#

Fit a TargetEncoder instance to a set of categories

Parameters:

xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique

ycudf.Series or cupy.ndarray
Series containing the target variable.

fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns:

selfTargetEncoder
A fitted instance of itself to allow method chaining

fit_transform(x, y, fold_ids=None)[source]#

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)

Parameters:

xcudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique

ycudf.Series or cupy.ndarray
Series containing the target variable.

fold_idscudf.Series or cupy.ndarray
Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns:

encodedcupy.ndarray
The ordinally encoded input series

get_params(deep=False)[source]#

Returns a dict of all params owned by this class.

transform(x)[source]#

Transform an input into its categorical keys.

This is intended for test data. For fitting and transforming the training data, prefer fit_transform.

Parameters:

xcudf.Series
Input keys to be transformed. Its values doesn’t have to match the categories given to fit

Returns:

encodedcupy.ndarray
The ordinally encoded input series

Feature Scaling and Normalization (Single-GPU)#

class cuml.preprocessing.MaxAbsScaler(*args, **kwargs)[source]#

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

Parameters:

copyboolean, optional, default is True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Attributes:

scale_ndarray, shape (n_features,): Per feature relative scaling of the data.
max_abs_ndarray, shape (n_features,): Per feature maximum absolute value.
n_samples_seen_int: The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

Methods

`fit`(X[, y])	Compute the maximum absolute value to be used for later scaling.
`inverse_transform`(X)	Scale back the data to the original representation
`partial_fit`(X[, y])	Online computation of max absolute value of X for later scaling.
`transform`(X)	Scale the data

See also

maxabs_scale: Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

Examples

>>> from cuml.preprocessing import MaxAbsScaler
>>> import cupy as cp
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X = cp.array(X)
>>> transformer = MaxAbsScaler().fit(X)
>>> transformer
MaxAbsScaler()
>>> transformer.transform(X)
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

fit(X, y=None) → MaxAbsScaler[source]#

Compute the maximum absolute value to be used for later scaling.

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

inverse_transform(X) → SparseCumlArray[source]#

Scale back the data to the original representation

Parameters:

X{array-like, sparse matrix}: The data that should be transformed back.

partial_fit(X, y=None) → MaxAbsScaler[source]#

Online computation of max absolute value of X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to compute the mean and standard deviation used for later scaling along the features axis.
yNone: Ignored.

Returns:

selfobject: Transformer instance.

transform(X) → SparseCumlArray[source]#

Scale the data

Parameters:

X{array-like, sparse matrix}: The data that should be scaled.

class cuml.preprocessing.MinMaxScaler(*args, **kwargs)[source]#

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Parameters:

feature_rangetuple (min, max), default=(0, 1): Desired range of transformed data.
copybool, default=True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Attributes:

min_ndarray of shape (n_features,): Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_
scale_ndarray of shape (n_features,): Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))
data_min_ndarray of shape (n_features,): Per feature minimum seen in the data
data_max_ndarray of shape (n_features,): Per feature maximum seen in the data
data_range_ndarray of shape (n_features,): Per feature range (data_max_ - data_min_) seen in the data
n_samples_seen_int: The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

Methods

`fit`(X[, y])	Compute the minimum and maximum to be used for later scaling.
`inverse_transform`(X)	Undo the scaling of X according to feature_range.
`partial_fit`(X[, y])	Online computation of min and max on X for later scaling.
`transform`(X)	Scale features of X according to feature_range.

See also

minmax_scale: Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

Examples

>>> from cuml.preprocessing import MinMaxScaler
>>> import cupy as cp
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> data = cp.array(data)
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform(cp.array([[2, 2]])))
[[1.5 0. ]]

fit(X, y=None) → MinMaxScaler[source]#

Compute the minimum and maximum to be used for later scaling.

Parameters:

Xarray-like of shape (n_samples, n_features): The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
yNone: Ignored.

Returns:

selfobject: Fitted scaler.

inverse_transform(X) → CumlArray[source]#

Undo the scaling of X according to feature_range.

Parameters:

Xarray-like of shape (n_samples, n_features): Input data that will be transformed. It cannot be sparse.

Returns:

Xtarray-like of shape (n_samples, n_features): Transformed data.

partial_fit(X, y=None) → MinMaxScaler[source]#

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:

Xarray-like of shape (n_samples, n_features): The data used to compute the mean and standard deviation used for later scaling along the features axis.
yNone: Ignored.

Returns:

selfobject: Transformer instance.

transform(X) → CumlArray[source]#

Scale features of X according to feature_range.

Parameters:

Xarray-like of shape (n_samples, n_features): Input data that will be transformed.

Returns:

Xtarray-like of shape (n_samples, n_features): Transformed data.

class cuml.preprocessing.Normalizer(*args, **kwargs)[source]#

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and sparse matrix

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

Parameters:

norm‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default): The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Methods

`fit`(X[, y])	Do nothing and return the estimator unchanged
`transform`(X[, copy])	Scale each non zero row of X to unit norm

See also

normalize: Equivalent function without the estimator API.

Notes

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

Examples

>>> from cuml.preprocessing import Normalizer
>>> import cupy as cp
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> X = cp.array(X)
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])

fit(X, y=None) → Normalizer[source]#

Do nothing and return the estimator unchanged

This method is just there to implement the usual API and hence work in pipelines.

Parameters:

X{array-like, CSR matrix}

transform(X, copy=None) → SparseCumlArray[source]#

Scale each non zero row of X to unit norm

Parameters:

X{array-like, CSR matrix}, shape [n_samples, n_features]: The data to normalize, row by row.
copybool, optional (default: None): Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

class cuml.preprocessing.RobustScaler(*args, **kwargs)[source]#

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

Parameters:

with_centeringboolean, default=True: If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_scalingboolean, default=True: If True, scale the data to interquartile range.
quantile_rangetuple (q_min, q_max), 0.0 < q_min < q_max < 100.0: Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
copyboolean, optional, default=True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Attributes:

center_array of floats: The median value for each feature in the training set.
scale_array of floats: The (scaled) interquartile range for each feature in the training set.

Methods

`fit`(X[, y])	Compute the median and quantiles to be used for scaling.
`inverse_transform`(X)	Scale back the data to the original representation
`transform`(X)	Center and scale the data.

See also

robust_scale: Equivalent function without the estimator API.
cuml.decomposition.PCA: Further removes the linear correlation across features with whiten=True.

Examples

>>> from cuml.preprocessing import RobustScaler
>>> import cupy as cp
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> X = cp.array(X)
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

fit(X, y=None) → RobustScaler[source]#

Compute the median and quantiles to be used for scaling.

Parameters:

X{array-like, CSC matrix}, shape [n_samples, n_features]: The data used to compute the median and quantiles used for later scaling along the features axis.

inverse_transform(X) → SparseCumlArray[source]#

Scale back the data to the original representation

Parameters:

X{array-like, sparse matrix}: The data used to scale along the specified axis.

transform(X) → SparseCumlArray[source]#

Center and scale the data.

Parameters:

X{array-like, sparse matrix}: The data used to scale along the specified axis.

class cuml.preprocessing.StandardScaler(*args, **kwargs)[source]#

Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Parameters:

copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
with_meanboolean, True by default: If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_stdboolean, True by default: If True, scale the data to unit variance (or equivalently, unit standard deviation).

Attributes:

scale_ndarray or None, shape (n_features,): Per feature relative scaling of the data. This is calculated using sqrt(var_). Equal to None when with_std=False.
mean_ndarray or None, shape (n_features,): The mean value for each feature in the training set. Equal to None when with_mean=False.
var_ndarray or None, shape (n_features,): The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.
n_samples_seen_int or array, shape (n_features,): The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

Methods

`fit`(X[, y])	Compute the mean and std to be used for later scaling.
`inverse_transform`(X[, copy])	Scale back the data to the original representation
`partial_fit`(X[, y])	Online computation of mean and std on X for later scaling.
`transform`(X[, copy])	Perform standardization by centering and scaling

See also

scale: Equivalent function without the estimator API.
cuml.decomposition.PCA: Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

Examples

>>> from cuml.preprocessing import StandardScaler
>>> import cupy as cp
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> data = cp.array(data)
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform(cp.array([[2, 2]])))
[[3. 3.]]

fit(X, y=None) → StandardScaler[source]#

Compute the mean and std to be used for later scaling.

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to compute the mean and standard deviation used for later scaling along the features axis.
yNone: Ignored

inverse_transform(X, copy=None) → SparseCumlArray[source]#

Scale back the data to the original representation

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to scale along the features axis.
copybool, optional (default: None): Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Returns:

X_tr{array-like, sparse matrix}, shape [n_samples, n_features]: Transformed array.

partial_fit(X, y=None) → StandardScaler[source]#

Online computation of mean and std on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to compute the mean and standard deviation used for later scaling along the features axis.
yNone: Ignored.

Returns:

selfobject: Transformer instance.

transform(X, copy=None) → SparseCumlArray[source]#

Perform standardization by centering and scaling

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data used to scale along the features axis.
copybool, optional (default: None): Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

cuml.preprocessing.maxabs_scale(X, *, axis=0, copy=True)[source]#

Scale each feature to the [-1, 1] range without breaking the sparsity.

This estimator scales each feature individually such that the maximal absolute value of each feature in the training set will be 1.0.

This scaler can also be applied to sparse CSR or CSC matrices.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): The data.
axisint (0 by default): axis used to scale along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
copyboolean, optional, default is True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

See also

MaxAbsScaler: Performs scaling to the [-1, 1] range using the``Transformer`` API

Notes

NaNs are treated as missing values: disregarded to compute the statistics, and maintained during the data transformation.

cuml.preprocessing.minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True)[source]#

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.

The transformation is given by (when axis=0):

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

The transformation is calculated as (when axis=0):

X_scaled = scale * X + min - X.min(axis=0) * scale
where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))

This transformation is often used as an alternative to zero mean, unit variance scaling.

Parameters:

Xarray-like of shape (n_samples, n_features): The data.
feature_rangetuple (min, max), default=(0, 1): Desired range of transformed data.
axisint, default=0: Axis used to scale along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
copybool, default=True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

See also

MinMaxScaler: Performs scaling to a given range using the``Transformer`` API

cuml.preprocessing.normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)[source]#

Scale input vectors individually to unit norm (vector length).

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data to normalize, element by element. Please provide CSC matrix to normalize on axis 0, conversely provide CSR matrix to normalize on axis 1
norm‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default): The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
axis0 or 1, optional (1 by default): axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.
return_normboolean, default False: whether to return the computed norms

Returns:

X{array-like, sparse matrix}, shape [n_samples, n_features]: Normalized input X.
normsarray, shape [n_samples] if axis=1 else [n_features]: An array of norms along given axis for X. When X is sparse, a NotImplementedError will be raised for norm ‘l1’ or ‘l2’.

See also

Normalizer: Performs normalization using the Transformer API

cuml.preprocessing.robust_scale(X, *, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)[source]#

Standardize a dataset along any axis

Center to the median and component wise scale according to the interquartile range.

Parameters:

X{array-like, sparse matrix}: The data to center and scale.
axisint (0 by default): axis used to compute the medians and IQR along. If 0, independently scale each feature, otherwise (if 1) scale each sample.
with_centeringboolean, True by default: If True, center the data before scaling.
with_scalingboolean, True by default: If True, scale the data to unit variance (or equivalently, unit standard deviation).
quantile_rangetuple (q_min, q_max), 0.0 < q_min < q_max < 100.0: Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
copyboolean, optional, default is True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

See also

RobustScaler: Performs centering and scaling using the Transformer API

Notes

This implementation will refuse to center sparse matrices since it would make them non-sparse and would potentially crash the program with memory exhaustion problems.

Instead the caller is expected to either set explicitly with_centering=False (in that case, only variance scaling will be performed on the features of the CSR matrix) or to densify the matrix if he/she expects the materialized dense array to fit in memory.

To avoid memory copy the caller should pass a CSR matrix.

cuml.preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)[source]#

Standardize a dataset along any axis

Center to the mean and component wise scale to unit variance.

Parameters:

X{array-like, sparse matrix}: The data to center and scale.
axisint (0 by default): axis used to compute the means and standard deviations along. If 0, independently standardize each feature, otherwise (if 1) standardize each sample.
with_meanboolean, True by default: If True, center the data before scaling.
with_stdboolean, True by default: If True, scale the data to unit variance (or equivalently, unit standard deviation).
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

See also

StandardScaler: Performs scaling to unit variance using the``Transformer`` API

Notes

This implementation will refuse to center sparse matrices since it would make them non-sparse and would potentially crash the program with memory exhaustion problems.

Instead the caller is expected to either set explicitly with_mean=False (in that case, only variance scaling will be performed on the features of the sparse matrix) or to densify the matrix if he/she expects the materialized dense array to fit in memory.

For optimal processing the caller should pass a CSC matrix.

NaNs are treated as missing values: disregarded to compute the statistics, and maintained during the data transformation.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

Other preprocessing methods (Single-GPU)#

class cuml.preprocessing.Binarizer(*args, **kwargs)[source]#

Binarize data (set feature values to 0 or 1) according to a threshold

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

Parameters:

thresholdfloat, optional (0.0 by default): Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Methods

`fit`(X[, y])	Do nothing and return the estimator unchanged
`transform`(X[, copy])	Binarize each element of X

See also

binarize: Equivalent function without the estimator API.

Notes

If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

Examples

>>> from cuml.preprocessing import Binarizer
>>> import cupy as cp
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X = cp.array(X)
>>> transformer = Binarizer().fit(X)  # fit does nothing.
>>> transformer
Binarizer()
>>> transformer.transform(X)
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

fit(X, y=None) → Binarizer[source]#

Do nothing and return the estimator unchanged

This method is just there to implement the usual API and hence work in pipelines.

Parameters:

X{array-like, sparse matrix}

transform(X, copy=None) → SparseCumlArray[source]#

Binarize each element of X

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data to binarize, element by element.
copybool: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

class cuml.preprocessing.FunctionTransformer(*args, **kwargs)[source]#

Constructs a transformer from an arbitrary callable.

A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.

Parameters:

funccallable, default=None: The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.
inverse_funccallable, default=None: The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.
accept_sparsebool, default=False: Indicate that func accepts a sparse matrix as input. Otherwise, if accept_sparse is false, sparse matrix inputs will cause an exception to be raised.
check_inversebool, default=True: Whether to check that or func followed by inverse_func leads to the original inputs. It can be used for a sanity check, raising a warning when the condition is not fulfilled.
kw_argsdict, default=None: Dictionary of additional keyword arguments to pass to func.
inv_kw_argsdict, default=None: Dictionary of additional keyword arguments to pass to inverse_func.

Methods

`fit`(X[, y])	Fit transformer by checking X.
`inverse_transform`(X)	Transform X using the inverse function.
`transform`(X)	Transform X using the forward function.

Examples

>>> import cupy as cp
>>> from cuml.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(func=cp.log1p)
>>> X = cp.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[0.       , 0.6931...],
       [1.0986..., 1.3862...]])

fit(X, y=None) → FunctionTransformer[source]#

Fit transformer by checking X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): Input array.

Returns:

self

inverse_transform(X) → SparseCumlArray[source]#

Transform X using the inverse function.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): Input array.

Returns:

X_out{array-like, sparse matrix}, shape (n_samples, n_features): Transformed input.

transform(X) → SparseCumlArray[source]#

Transform X using the forward function.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): Input array.

Returns:

X_out{array-like, sparse matrix}, shape (n_samples, n_features): Transformed input.

class cuml.preprocessing.KBinsDiscretizer(*args, **kwargs)[source]#

Bin continuous data into intervals.

Parameters:

n_binsint or array-like, shape (n_features,) (default=5)

The number of bins to produce. Raises ValueError if n_bins < 2.

encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, (default=’onehot’)

Method used to encode the transformed result.

onehot: Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
onehot-dense: Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
ordinal: Return the bin identifier encoded as an integer value.

strategy{‘uniform’, ‘quantile’, ‘kmeans’}, (default=’quantile’)

Strategy used to define the widths of the bins.

uniform: All bins in each feature have identical widths.
quantile: All bins in each feature have the same number of points.
kmeans: Values in each bin have the same nearest center of a 1D k-means cluster.

Attributes:

n_bins_int array, shape (n_features,): Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
bin_edges_array of arrays, shape (n_features, ): The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

Methods

`fit`(X[, y])	Fit the estimator.
`inverse_transform`(Xt)	Transform discretized data back to original feature space.
`transform`(X)	Discretize the data.

See also

cuml.preprocessing.Binarizer: Class used to bin values as 0 or 1 based on a parameter threshold.

Notes

In bin edges for feature i, the first and last values are used only for inverse_transform. During transform, bin edges are extended to:

np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

You can combine KBinsDiscretizer with cuml.compose.ColumnTransformer if you only want to preprocess part of the features.

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., sklearn.feature_selection.VarianceThreshold).

Examples

>>> from cuml.preprocessing import KBinsDiscretizer
>>> import cupy as cp
>>> X = [[-2, 1, -4,   -1],
...      [-1, 2, -3, -0.5],
...      [ 0, 3, -2,  0.5],
...      [ 1, 4, -1,    2]]
>>> X = cp.array(X)
>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
>>> est.fit(X)
KBinsDiscretizer(...)
>>> Xt = est.transform(X)
>>> Xt
array([[0, 0, 0, 0],
       [1, 1, 1, 0],
       [2, 2, 2, 1],
       [2, 2, 2, 2]], dtype=int32)

Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

>>> est.bin_edges_[0]
array([-2., -1.,  0.,  1.])
>>> est.inverse_transform(Xt)
array([[-1.5,  1.5, -3.5, -0.5],
       [-0.5,  2.5, -2.5, -0.5],
       [ 0.5,  3.5, -1.5,  0.5],
       [ 0.5,  3.5, -1.5,  1.5]])

fit(X, y=None) → KBinsDiscretizer[source]#

Fit the estimator.

Parameters:

Xnumeric array-like, shape (n_samples, n_features): Data to be discretized.
yNone: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:

self

inverse_transform(Xt) → SparseCumlArray[source]#

Transform discretized data back to original feature space.

Note that this function does not regenerate the original data due to discretization rounding.

Parameters:

Xtnumeric array-like, shape (n_sample, n_features): Transformed data in the binned space.

Returns:

Xinvnumeric array-like: Data in the original feature space.

transform(X) → SparseCumlArray[source]#

Discretize the data.

Parameters:

Xnumeric array-like, shape (n_samples, n_features): Data to be discretized.

Returns:

Xtnumeric array-like or sparse matrix: Data in the binned space.

class cuml.preprocessing.KernelCenterer(*args, **kwargs)[source]#

Center a kernel matrix

Let K(x, z) be a kernel defined by phi(x)^T phi(z), where phi is a function mapping x to a Hilbert space. KernelCenterer centers (i.e., normalize to have zero mean) the data without explicitly computing phi(x). It is equivalent to centering phi(x) with cuml.preprocessing.StandardScaler(with_std=False).

Attributes:

K_fit_rows_array, shape (n_samples,): Average of each column of kernel matrix
K_fit_all_float: Average of kernel matrix

Methods

`fit`(K[, y])	Fit KernelCenterer
`transform`(K[, copy])	Center kernel matrix.

Examples

>>> import cupy as cp
>>> from cuml.preprocessing import KernelCenterer
>>> from cuml.metrics import pairwise_kernels
>>> X = cp.array([[ 1., -2.,  2.],
...               [ -2.,  1.,  3.],
...               [ 4.,  1., -2.]])
>>> K = pairwise_kernels(X, metric='linear')
>>> K
array([[  9.,   2.,  -2.],
       [  2.,  14., -13.],
       [ -2., -13.,  21.]])
>>> transformer = KernelCenterer().fit(K)
>>> transformer
KernelCenterer()
>>> transformer.transform(K)
array([[  5.,   0.,  -5.],
       [  0.,  14., -14.],
       [ -5., -14.,  19.]])

fit(K, y=None) → KernelCenterer[source]#

Fit KernelCenterer

Parameters:

Knumpy array of shape [n_samples, n_samples]: Kernel matrix.

Returns:

selfreturns an instance of self.

transform(K, copy=True) → CumlArray[source]#

Center kernel matrix.

Parameters:

Knumpy array of shape [n_samples1, n_samples2]: Kernel matrix.
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

Returns:

K_newnumpy array of shape [n_samples1, n_samples2]

class cuml.preprocessing.MissingIndicator(*args, **kwargs)[source]#

Binary indicators for missing values.

Note that this component typically should not be used in a vanilla Pipeline consisting of transformers and a classifier, but rather could be added using a FeatureUnion or ColumnTransformer.

Parameters:

missing_valuesnumber, string, np.nan (default) or None

The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

featuresstr, default=None

Whether the imputer mask should represent all or a subset of features.

If “missing-only” (default), the imputer mask will only represent features containing missing values during fit time.
If “all”, the imputer mask will represent all features.

sparseboolean or “auto”, default=None

Whether the imputer mask format should be sparse or dense.

If “auto” (default), the imputer mask will be of same type as input.
If True, the imputer mask will be a sparse matrix.
If False, the imputer mask will be a numpy array.

error_on_newboolean, default=None

If True (default), transform will raise an error when there are features with missing values in transform that have no missing values in fit. This is applicable only when features="missing-only".

Attributes:

features_ndarray, shape (n_missing_features,) or (n_features,): The features indices which will be returned when calling transform. They are computed during fit. For features='all', it is to range(n_features).

Methods

`fit`(X[, y])	Fit the transformer on X.
`fit_transform`(X[, y])	Generate missing values indicator for X.
`transform`(X)	Generate missing values indicator for X.

Examples

>>> import numpy as np
>>> from sklearn.impute import MissingIndicator
>>> X1 = np.array([[np.nan, 1, 3],
...                [4, 0, np.nan],
...                [8, 1, 0]])
>>> X2 = np.array([[5, 1, np.nan],
...                [np.nan, 2, 3],
...                [2, 4, 0]])
>>> indicator = MissingIndicator()
>>> indicator.fit(X1)
MissingIndicator()
>>> X2_tr = indicator.transform(X2)
>>> X2_tr
array([[False,  True],
       [ True, False],
       [False, False]])

fit(X, y=None) → MissingIndicator[source]#

Fit the transformer on X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): Input data, where n_samples is the number of samples and n_features is the number of features.

Returns:

selfobject: Returns self.

fit_transform(X, y=None) → SparseCumlArray[source]#

Generate missing values indicator for X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): The input data to complete.

Returns:

Xt{ndarray or sparse matrix}, shape (n_samples, n_features) or (n_samples, n_features_with_missing): The missing indicator for input data. The data type of Xt will be boolean.

transform(X) → SparseCumlArray[source]#

Generate missing values indicator for X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): The input data to complete.

Returns:

Xt{ndarray or sparse matrix}, shape (n_samples, n_features) or (n_samples, n_features_with_missing): The missing indicator for input data. The data type of Xt will be boolean.

class cuml.preprocessing.PolynomialFeatures(*args, **kwargs)[source]#

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Parameters:

degreeinteger: The degree of the polynomial features. Default = 2.
interaction_onlyboolean, default = False: If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).
include_biasboolean: If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
orderstr in {‘C’, ‘F’}, default ‘C’: Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.

Attributes:

powers_array, shape (n_output_features, n_input_features): powers_[i, j] is the exponent of the jth input in the ith output.
n_input_features_int: The total number of input features.
n_output_features_int: The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Methods

`fit`(X[, y])	Compute number of output features.
`get_feature_names`([input_features])	Return feature names for output features
`transform`(X)	Transform data to polynomial features

Notes

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

Examples

>>> import numpy as np
>>> from cuml.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

fit(X, y=None) → PolynomialFeatures[source]#

Compute number of output features.

Parameters:

Xarray-like, shape (n_samples, n_features): The data.

Returns:

selfinstance

get_feature_names(input_features=None)[source]#

Return feature names for output features

Parameters:

input_featureslist of string, length n_features, optional: String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns:

output_feature_nameslist of string, length n_output_features

transform(X) → SparseCumlArray[source]#

Transform data to polynomial features

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]

The data to transform, row by row.

Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.

If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.

Returns:

XP{array-like, sparse matrix}, shape [n_samples, NP]: The matrix of features, where NP is the number of polynomial features generated from the combination of inputs.

class cuml.preprocessing.PowerTransformer(*args, **kwargs)[source]#

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

Parameters:

methodstr, (default=’yeo-johnson’)

The power transform method. Available methods are:

‘yeo-johnson’ [1], works with positive and negative values
‘box-cox’ [2], only works with strictly positive values

standardizeboolean, default=True

Set to True to apply zero-mean, unit-variance normalization to the transformed output.

copyboolean, optional, default=True

Set to False to perform inplace computation during transformation.

Attributes:

lambdas_array of float, shape (n_features,): The parameters of the power transformation for the selected features.

Methods

`fit`(X[, y])	Estimate the optimal parameter lambda for each feature.
`fit_transform`(X[, y])	Fit to data, then transform it.
`inverse_transform`(X)	Apply the inverse power transformation using the fitted lambdas.
`transform`(X)	Apply the power transform to each feature using the fitted lambdas.

See also

power_transform: Equivalent function without the estimator API.
QuantileTransformer: Maps data to a standard normal distribution with the parameter output_distribution='normal'.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

References

[1]

I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).

[2]

G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).

Examples

>>> import cupy as cp
>>> from cuml.preprocessing import PowerTransformer
>>> pt = PowerTransformer()
>>> data = cp.array([[1, 2], [3, 2], [4, 5]])
>>> print(pt.fit(data))
PowerTransformer()
>>> print(pt.lambdas_)
[ 1.386... -3.100...]
>>> print(pt.transform(data))
[[-1.316... -0.707...]
 [ 0.209... -0.707...]
 [ 1.106...  1.414...]]

fit(X, y=None) → PowerTransformer[source]#

Estimate the optimal parameter lambda for each feature.

The optimal lambda parameter for minimizing skewness is estimated on each feature independently using maximum likelihood.

Parameters:

Xarray-like, shape (n_samples, n_features): The data used to estimate the optimal transformation parameters.
yIgnored

Returns:

selfobject

fit_transform(X, y=None) → CumlArray[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
yndarray of shape (n_samples,), default=None: Target values.
**fit_paramsdict: Additional fit parameters.

Returns:

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

inverse_transform(X) → CumlArray[source]#

Apply the inverse power transformation using the fitted lambdas.

The inverse of the Box-Cox transformation is given by:

if lambda_ == 0:
    X = exp(X_trans)
else:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_)

The inverse of the Yeo-Johnson transformation is given by:

if X >= 0 and lambda_ == 0:
    X = exp(X_trans) - 1
elif X >= 0 and lambda_ != 0:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1
elif X < 0 and lambda_ != 2:
    X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_))
elif X < 0 and lambda_ == 2:
    X = 1 - exp(-X_trans)

Parameters:

Xarray-like, shape (n_samples, n_features): The transformed data.

Returns:

Xarray-like, shape (n_samples, n_features): The original data

transform(X) → CumlArray[source]#

Apply the power transform to each feature using the fitted lambdas.

Parameters:

Xarray-like, shape (n_samples, n_features): The data to be transformed using a power transformation.

Returns:

X_transarray-like, shape (n_samples, n_features): The transformed data.

class cuml.preprocessing.QuantileTransformer(*args, **kwargs)[source]#

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Parameters:

n_quantilesint, optional (default=1000 or n_samples): Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
output_distributionstr, optional (default=’uniform’): Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
ignore_implicit_zerosbool, optional (default=False): Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
subsampleint, optional (default=1e5): Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
random_stateint, RandomState instance or None, optional (default=None): Determines random number generation for subsampling and smoothing noise. Please see subsample for more details. Pass an int for reproducible results across multiple function calls. See Glossary
copyboolean, optional, (default=True): Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

Attributes:

n_quantiles_integer: The actual number of quantiles used to discretize the cumulative distribution function.
quantiles_ndarray, shape (n_quantiles, n_features): The values corresponding the quantiles of reference.
references_ndarray, shape(n_quantiles, ): Quantiles of references.

Methods

`fit`(X[, y])	Compute the quantiles used for transforming.
`inverse_transform`(X)	Back-projection to the original space.
`transform`(X)	Feature-wise transformation of the data.

See also

quantile_transform: Equivalent function without the estimator API.
PowerTransformer: Perform mapping to a normal distribution using a power transform.
StandardScaler: Perform standardization that is faster, but less robust to outliers.
RobustScaler: Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

Examples

>>> import cupy as cp
>>> from cuml.preprocessing import QuantileTransformer
>>> rng = cp.random.RandomState(0)
>>> X = cp.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X)
array([...])

fit(X, y=None) → QuantileTransformer[source]#

Compute the quantiles used for transforming.

Parameters:

Xndarray or sparse matrix, shape (n_samples, n_features): The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:

selfobject

inverse_transform(X) → SparseCumlArray[source]#

Back-projection to the original space.

Parameters:

Xndarray or sparse matrix, shape (n_samples, n_features): The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:

Xtndarray or sparse matrix, shape (n_samples, n_features): The projected data.

transform(X) → SparseCumlArray[source]#

Feature-wise transformation of the data.

Parameters:

Xndarray or sparse matrix, shape (n_samples, n_features): The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:

Xtndarray or sparse matrix, shape (n_samples, n_features): The projected data.

class cuml.preprocessing.SimpleImputer(*args, **kwargs)[source]#

Imputation transformer for completing missing values.

Parameters:

missing_valuesnumber, string, np.nan (default) or None

strategystring, default=’mean’

The imputation strategy.

If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

strategy=”constant” for fixed value imputation.

fill_valuestring or numerical value, default=None

When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

verboseinteger, default=0

Controls the verbosity of the imputer.

copyboolean, default=True

If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.

add_indicatorboolean, default=False

If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

Attributes:

statistics_array of shape (n_features,): The imputation fill value for each feature. Computing statistics can result in np.nan values. During transform(), features corresponding to np.nan statistics will be discarded.

Methods

`fit`(X[, y])	Fit the imputer on X.
`transform`(X)	Impute all missing values in X.

See also

IterativeImputer: Multivariate imputation of missing values.

Notes

Columns which only contained missing values at fit() are discarded upon transform() if strategy is not “constant”.

Examples

>>> import cupy as cp
>>> from cuml.preprocessing import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=cp.nan, strategy='mean')
>>> imp_mean.fit(cp.asarray([[7, 2, 3], [4, cp.nan, 6], [10, 5, 9]]))
SimpleImputer()
>>> X = [[cp.nan, 2, 3], [4, cp.nan, 6], [10, cp.nan, 9]]
>>> print(imp_mean.transform(cp.asarray(X)))
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]

fit(X, y=None) → SimpleImputer[source]#

Fit the imputer on X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): Input data, where n_samples is the number of samples and n_features is the number of features.

Returns:

selfSimpleImputer

transform(X) → SparseCumlArray[source]#

Impute all missing values in X.

Parameters:

X{array-like, sparse matrix}, shape (n_samples, n_features): The input data to complete.

cuml.preprocessing.add_dummy_feature(X, value=1.0)[source]#

Augment dataset with an additional dummy feature.

This is useful for fitting an intercept term with implementations which cannot otherwise fit it directly.

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: Data.
valuefloat: Value to use for the dummy feature.

Returns:

X{array, sparse matrix}, shape [n_samples, n_features + 1]: Same data with dummy feature added as first column.

Examples

>>> from cuml.preprocessing import add_dummy_feature
>>> import cupy as cp
>>> add_dummy_feature(cp.array([[0, 1], [1, 0]]))
array([[1., 0., 1.],
       [1., 1., 0.]])

cuml.preprocessing.binarize(X, *, threshold=0.0, copy=True)[source]#

Boolean thresholding of array-like or sparse matrix

Parameters:

X{array-like, sparse matrix}, shape [n_samples, n_features]: The data to binarize, element by element.
thresholdfloat, optional (0.0 by default): Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
copyboolean, optional, default True: Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

See also

Binarizer: Performs binarization using the Transformer API

class cuml.compose.ColumnTransformer(*args, **kwargs)[source]#

Applies transformers to columns of an array or dataframe.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

Parameters:

transformerslist of tuples

List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data:

namestr
Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.
transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
columnsstr, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

sparse_thresholdfloat, default=0.3

If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

n_jobsint, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. for more details.

transformer_weightsdict, default=None

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed.

Attributes:

transformers_list: The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In case there were no columns selected, this will be the unfitted transformer. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).
named_transformers_Bunch: Access the fitted transformer by name.
sparse_output_bool: Boolean flag indicating whether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

See also

make_column_transformer: Convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.
make_column_selector: Convenience function for selecting columns based on datatype or the columns name with a regex pattern.

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Examples

>>> import cupy as cp
>>> from cuml.compose import ColumnTransformer
>>> from cuml.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = cp.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

fit(X, y=None) → ColumnTransformer[source]#

Fit all transformers using X.

Parameters:

X{array-like, dataframe} of shape (n_samples, n_features): Input data, of which specified subsets are used to fit the transformers.
yarray-like of shape (n_samples,…), default=None: Targets for supervised learning.

Returns:

selfColumnTransformer: This estimator

fit_transform(X, y=None) → SparseCumlArray[source]#

Fit all transformers, transform the data and concatenate results.

Parameters:

X{array-like, dataframe} of shape (n_samples, n_features): Input data, of which specified subsets are used to fit the transformers.
yarray-like of shape (n_samples,), default=None: Targets for supervised learning.

Returns:

X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components): hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names()[source]#

Get feature names from all transformers.

Returns:

feature_nameslist of strings: Names of the features produced by transform.

get_params(deep=True)[source]#

Get parameters for this estimator.

Returns the parameters given in the constructor as well as the estimators contained within the transformers of the ColumnTransformer.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

property named_transformers_#

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_params(**kwargs)[source]#

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params(). Note that you can directly set the parameters of the estimators contained in transformers of ColumnTransformer.

Returns:

self

transform(X) → SparseCumlArray[source]#

Transform X separately by each transformer, concatenate results.

Parameters:

X{array-like, dataframe} of shape (n_samples, n_features): The data to be transformed by subset.

Returns:

X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components): hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

class cuml.compose.make_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None)[source]#

Create a callable to select columns to be used with ColumnTransformer.

make_column_selector() can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.

Parameters:

patternstr, default=None: Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.
dtype_includecolumn dtype or list of column dtypes, default=None: A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes().
dtype_excludecolumn dtype or list of column dtypes, default=None: A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes().

Returns:

selectorcallable: Callable for column selection to be used by a ColumnTransformer.

See also

ColumnTransformer: Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from cuml.preprocessing import StandardScaler, OneHotEncoder
>>> from cuml.compose import make_column_transformer
>>> from cuml.compose import make_column_selector
>>> import cupy as cp
>>> import cudf
>>> X = cudf.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
...                    'rating': [5, 3, 4, 5]})
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        make_column_selector(dtype_include=cp.number)),  # rating
...       (OneHotEncoder(),
...        make_column_selector(dtype_include=object)))  # city
>>> ct.fit_transform(X)
array([[ 0.90453403,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  1.        ,  0.        ],
       [ 0.90453403,  0.        ,  0.        ,  1.        ]])

cuml.compose.make_column_transformer(*transformers, remainder='drop', sparse_threshold=0.3, n_jobs=None, verbose=False)[source]#

Construct a ColumnTransformer from the given transformers.

This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting with transformer_weights.

Parameters:

*transformerstuples

Tuples of the form (transformer, columns) specifying the transformer objects to be applied to subsets of the data:

transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
columnsstr, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’

sparse_thresholdfloat, default=0.3

If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.

n_jobsint, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed.

Returns:

ctColumnTransformer

See also

ColumnTransformer: Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from cuml.preprocessing import StandardScaler, OneHotEncoder
>>> from cuml.compose import make_column_transformer
>>> make_column_transformer(
...     (StandardScaler(), ['numerical_column']),
...     (OneHotEncoder(), ['categorical_column']))
ColumnTransformer(transformers=[('standardscaler', StandardScaler(...),
                                 ['numerical_column']),
                                ('onehotencoder', OneHotEncoder(...),
                                 ['categorical_column'])])

Text Preprocessing (Single-GPU)#

class cuml.preprocessing.text.stem.PorterStemmer(mode='NLTK_EXTENSIONS')[source]#
A word stemmer based on the Porter stemming algorithm.

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.

Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. Only below mode is supported currently PorterStemmer.NLTK_EXTENSIONS

Implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web.

Parameters:

mode: Modes of stemming (Only supports (NLTK_EXTENSIONS) currently)
default(“NLTK_EXTENSIONS”)

Methods

stem(word_str_ser)

Stem Words using Porter stemmer

Examples
>>> import cudf
>>> from cuml.preprocessing.text.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> word_str_ser =  cudf.Series(['revival','singing','adjustable'])
>>> print(stemmer.stem(word_str_ser))
0     reviv
1      sing
2    adjust
dtype: object
stem(word_str_ser)[source]#

Stem Words using Porter stemmer

Parameters:

word_str_sercudf.Series
A string series of words to stem

Returns:

stemmed_sercudf.Series
Stemmed words strings series

Feature and Label Encoding (Dask-based Multi-GPU)#

class cuml.dask.preprocessing.LabelBinarizer(*, client=None, **kwargs)[source]#
A distributed version of LabelBinarizer for one-hot encoding a collection of labels.

Methods

fit(y)

Fit label binarizer

fit_transform(y)

Fit the label encoder and return transformed labels

inverse_transform(y[, threshold])

Invert a set of encoded labels back to original labels

transform(y)

Transform and return encoded labels

Examples

Create an array with labels and dummy encode them
>>> import cupy as cp
>>> import cupyx
>>> from cuml.dask.preprocessing import LabelBinarizer

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> import dask

>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)

>>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
...                     dtype=cp.int32)
>>> labels = dask.array.from_array(labels)

>>> lb = LabelBinarizer()
>>> encoded = lb.fit_transform(labels)
>>> print(encoded.compute())
[[1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0]
[0 1 0 0 0 0 0 0]]
>>> decoded = lb.inverse_transform(encoded)
>>> print(decoded.compute())
[ 0  5 10  7  2  4  1  0  0  4  3  2  1]
>>> client.close()
>>> cluster.close()
fit(y)[source]#

Fit label binarizer

Parameters:

yDask.Array of shape [n_samples,] or [n_samples, n_classes]
chunked by row. Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns:

selfreturns an instance of self.

fit_transform(y)[source]#

Fit the label encoder and return transformed labels

Parameters:

yDask.Array of shape [n_samples,] or [n_samples, n_classes]
target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns:

arrDask.Array backed by CuPy arrays containing encoded labels

inverse_transform(y, threshold=None)[source]#

Invert a set of encoded labels back to original labels

Parameters:

yDask.Array of shape [n_samples, n_classes] containing encoded
labels

thresholdfloat This value is currently ignored

Returns:

arrDask.Array backed by CuPy arrays containing original labels

transform(y)[source]#

Transform and return encoded labels

Parameters:

yDask.Array of shape [n_samples,] or [n_samples, n_classes]

Returns:

arrDask.Array backed by CuPy arrays containing encoded labels
class cuml.dask.preprocessing.LabelEncoder.LabelEncoder(*, client=None, verbose=False, **kwargs)[source]#
A cuDF-based implementation of ordinal label encoding

Parameters:

handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.

Methods

fit(y)

Fit a LabelEncoder instance to a set of categories

fit_transform(y[, delayed])

Simultaneously fit and transform an input

inverse_transform(y[, delayed])

Convert the data back to the original representation.

transform(y[, delayed])

Transform an input into its categorical keys.

Examples

Converting a categorical implementation to a numerical one
>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> import cudf
>>> import dask_cudf
>>> from cuml.dask.preprocessing import LabelEncoder

>>> import pandas as pd
>>> pd.set_option('display.max_colwidth', 2000)

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)
>>> df = cudf.DataFrame({'num_col':[10, 20, 30, 30, 30],
...                    'cat_col':['a','b','c','a','a']})
>>> ddf = dask_cudf.from_cudf(df, npartitions=2)

>>> # There are two functionally equivalent ways to do this
>>> le = LabelEncoder()
>>> le.fit(ddf.cat_col)  # le = le.fit(data.category) also works
<cuml.dask.preprocessing.LabelEncoder.LabelEncoder object at 0x...>
>>> encoded = le.transform(ddf.cat_col)
>>> print(encoded.compute())
0    0
1    1
2    2
3    0
4    0
dtype: uint8

>>> # This method is preferred
>>> le = LabelEncoder()
>>> encoded = le.fit_transform(ddf.cat_col)
>>> print(encoded.compute())
0    0
1    1
2    2
3    0
4    0
dtype: uint8

>>> # We can assign this to a new column
>>> ddf = ddf.assign(encoded=encoded.values)
>>> print(ddf.compute())
num_col cat_col  encoded
0       10       a        0
1       20       b        1
2       30       c        2
3       30       a        0
4       30       a        0
>>> # We can also encode more data
>>> test_data = cudf.Series(['c', 'a'])
>>> encoded = le.transform(dask_cudf.from_cudf(test_data,
...                                            npartitions=2))
>>> print(encoded.compute())
0    2
1    0
dtype: uint8

>>> # After train, ordinal label can be inverse_transform() back to
>>> # string labels
>>> ord_label = cudf.Series([0, 0, 1, 2, 1])
>>> ord_label = le.inverse_transform(
...    dask_cudf.from_cudf(ord_label,npartitions=2))

>>> print(ord_label.compute())
0    a
1    a
2    b
3    c
4    b
dtype: object
>>> client.close()
>>> cluster.close()
fit(y)[source]#

Fit a LabelEncoder instance to a set of categories

Parameters:

ydask_cudf.Series
Series containing the categories to be encoded. Its elements may or may not be unique

Returns:

selfLabelEncoder
A fitted instance of itself to allow method chaining

Notes

Number of unique classes will be collected at the client. It’ll consume memory proportional to the number of unique classes.

fit_transform(y, delayed=True)[source]#

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)

inverse_transform(y, delayed=True)[source]#

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

Parameters:

Xdask_cudf Series
The string representation of the categories.

delayedbool (default = True)
Whether to execute as a delayed task or eager.

Returns:

X_trdask_cudf.Series
Distributed object containing the inverse transformed array.

transform(y, delayed=True)[source]#

Transform an input into its categorical keys.

This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.

Parameters:

ydask_cudf.Series
Input keys to be transformed. Its values should match the categories given to fit

Returns:

encodeddask_cudf.Series
The ordinally encoded input series

Raises:

KeyError
if a category appears that was not seen in fit
class cuml.dask.preprocessing.OneHotEncoder(*, client=None, verbose=False, **kwargs)[source]#

Encode categorical features as a one-hot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

Parameters:

categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’
Categories (unique values) per feature. All categories are expected to fit on one GPU.

‘auto’ : Determine categories automatically from the training data.

DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None or a dict, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

None : retain all features (the default).

‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

Dict : drop[col] is the category in feature col that should be dropped.

sparsebool, default=False
This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See cupy/cupy#3223

dtypenumber type, default=np.float
Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Methods

fit(X)

Fit a multi-node multi-gpu OneHotEncoder to X.

inverse_transform(X[, delayed])

Convert the data back to the original representation.

transform(X[, delayed])

Transform X using one-hot encoding.

fit(X)[source]#

Fit a multi-node multi-gpu OneHotEncoder to X.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array
The data to determine the categories of each feature.

Returns:

self

inverse_transform(X, delayed=True)[source]#

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

Parameters:

XCuPy backed Dask Array, shape [n_samples, n_encoded_features]
The transformed data.

delayedbool (default = True)
Whether to execute as a delayed task or eager.

Returns:

X_trDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the inverse transformed array.

transform(X, delayed=True)[source]#

Transform X using one-hot encoding.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array
The data to encode.

delayedbool (default = True)
Whether to execute as a delayed task or eager.

Returns:

outDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the transformed input.

Feature Extraction (Single-GPU)#

class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#

Convert a collection of text documents to a matrix of token counts

If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.

Parameters:

lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional
Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Attributes:

vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]

Terms that were ignored because they either:

occurred in too many documents (max_df)

occurred in too few documents (min_df)

were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(raw_documents[, y])

Build a vocabulary of all tokens in the raw documents.

fit_transform(raw_documents[, y])

Build the vocabulary and return document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

inverse_transform(X)

Return terms per document with nonzero entries in X.

transform(raw_documents)

Transform documents to document-term matrix.

fit(raw_documents, y=None)[source]#

Build a vocabulary of all tokens in the raw documents.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

yNone
Ignored.

Returns:

self

fit_transform(raw_documents, y=None)[source]#

Build the vocabulary and return document-term matrix.

Equivalent to self.fit(X).transform(X) but preprocess X only once.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

yNone
Ignored.

Returns:

Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.

get_feature_names()[source]#

Array mapping from feature integer indices to feature name.

Returns:

feature_namesSeries
A list of feature names.

inverse_transform(X)[source]#

Return terms per document with nonzero entries in X.

Parameters:

Xarray-like of shape (n_samples, n_features)
Document-term matrix.

Returns:

X_invlist of cudf.Series of shape (n_samples,)
List of Series of terms.

transform(raw_documents)[source]#

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

Returns:

Xcupy csr array of shape (n_samples, n_features)
Document-term matrix.
class cuml.feature_extraction.text.HashingVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]#
Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained

it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

Parameters:

lowercasebool, default=True
Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, default=None
If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

n_featuresint, default=(2 ** 20)
The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

binarybool, default=False.
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

norm{‘l1’, ‘l2’}, default=’l2’
Norm used to normalize term vectors. None for no normalization.

alternate_signbool, default=True
When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

dtypetype, optional
Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Methods

fit(X[, y])

This method only checks the input type and the model parameter.

fit_transform(X[, y])

Transform a sequence of documents to a document-term matrix.

partial_fit(X[, y])

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

transform(raw_documents)

Transform documents to document-term matrix.

See also

CountVectorizer, TfidfVectorizer

Examples
>>> from cuml.feature_extraction.text import HashingVectorizer
>>> import pandas as pd
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(pd.Series(corpus))
>>> print(X.shape)
(4, 16)
fit(X, y=None)[source]#

This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless

Parameters:

Xcudf.Series or pd.Series
A Series of string documents

fit_transform(X, y=None)[source]#

Transform a sequence of documents to a document-term matrix.

Parameters:

Xiterable over raw text documents, length = n_samples
Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:

Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.

partial_fit(X, y=None)[source]#

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

Parameters:

Xcudf.Series(A Series of string documents).

transform(raw_documents)[source]#

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

Returns:

Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Document-term matrix.
class cuml.feature_extraction.text.TfidfVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]#

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

Parameters:

lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}, default=’word’
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional
Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

norm{‘l1’, ‘l2’}, default=’l2’

Each output row will have unit norm, either:

‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.

‘l1’: Sum of absolute values of vector elements is 1.

use_idfbool, default=True
Enable inverse-document-frequency reweighting.

smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tfbool, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Attributes:

idf_array of shape (n_features)
The inverse document frequency (IDF) vector; only defined if use_idf is True.

vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]

Terms that were ignored because they either:

occurred in too many documents (max_df)

occurred in too few documents (min_df)

were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(raw_documents)

Learn vocabulary and idf from training set.

fit_transform(raw_documents[, y])

Learn vocabulary and idf, return document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

transform(raw_documents)

Transform documents to document-term matrix.

Notes

The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.

This class is largely based on scikit-learn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD-3 license.

fit(raw_documents)[source]#

Learn vocabulary and idf from training set.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

Returns:

selfobject
Fitted vectorizer.

fit_transform(raw_documents, y=None)[source]#

Learn vocabulary and idf, return document-term matrix. This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

yNone
Ignored.

Returns:

Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.

get_feature_names()[source]#

Array mapping from feature integer indices to feature name.

Returns:

feature_namesSeries
A list of feature names.

transform(raw_documents)[source]#

Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Parameters:

raw_documentscudf.Series or pd.Series
A Series of string documents

Returns:

Xcupy csr array of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.

Feature Extraction (Dask-based Multi-GPU)#

class cuml.dask.feature_extraction.text.TfidfTransformer(*, client=None, verbose=False, **kwargs)[source]#
Distributed TF-IDF transformer

Methods

fit(X[, y])

Fit distributed TFIDF Transformer

fit_transform(X[, y])

Fit distributed TFIDFTransformer and then transform the given set of data samples.

transform(X[, y])

Use distributed TFIDFTransformer to transform the given set of data samples.

Examples
>>> import cupy as cp
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> from cuml.dask.common import to_sparse_dask_array
>>> from cuml.dask.naive_bayes import MultinomialNB
>>> import dask
>>> from cuml.dask.feature_extraction.text import TfidfTransformer

>>> # Create a local CUDA cluster
>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)

>>> # Load corpus
>>> twenty_train = fetch_20newsgroups(subset='train',
...                         shuffle=True, random_state=42)
>>> cv = CountVectorizer()
>>> xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
>>> X = to_sparse_dask_array(xformed, client)

>>> y = dask.array.from_array(twenty_train.target, asarray=False,
...                     fancy=False).astype(cp.int32)

>>> multi_gpu_transformer = TfidfTransformer()
>>> X_transformed = multi_gpu_transformer.fit_transform(X)
>>> X_transformed.compute_chunk_sizes()
dask.array<...>

>>> model = MultinomialNB()
>>> model.fit(X_transformed, y)
<cuml.dask.naive_bayes.naive_bayes.MultinomialNB object at 0x...>
>>> result = model.score(X_transformed, y)
>>> print(result)
array(0.93264981)
>>> client.close()
>>> cluster.close()
fit(X, y=None)[source]#

Fit distributed TFIDF Transformer

Parameters:

Xdask.Array with blocks containing dense or sparse cupy arrays

Returns:

cuml.dask.feature_extraction.text.TfidfTransformer instance

fit_transform(X, y=None)[source]#

Fit distributed TFIDFTransformer and then transform the given set of data samples.

Parameters:

Xdask.Array with blocks containing dense or sparse cupy arrays

Returns:

dask.Array with blocks containing transformed sparse cupy arrays

transform(X, y=None)[source]#

Use distributed TFIDFTransformer to transform the given set of data samples.

Parameters:

Xdask.Array with blocks containing dense or sparse cupy arrays

Returns:

dask.Array with blocks containing transformed sparse cupy arrays

Dataset Generation (Single-GPU)#

random_state#
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
cuml.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False, order='F', dtype='float32')[source]#
Generate isotropic Gaussian blobs for clustering.

Parameters:

n_samplesint or array-like, optional (default=100)
If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_featuresint, optional (default=2)
The number of features for each sample.

centersint or array of shape [n_centers, n_features], optional
(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat or sequence of floats, optional (default=1.0)
The standard deviation of the clusters.

center_boxpair of floats (min, max), optional (default=(-10.0, 10.0))
The bounding box for each cluster center when centers are generated at random.

shuffleboolean, optional (default=True)
Shuffle the samples.

random_stateint, RandomState instance, default=None
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

return_centersbool, optional (default=False)
If True, then return the centers of each cluster

order: str, optional (default=’F’)
The order of the generated samples

dtypestr, optional (default=’float32’)
Dtype of the generated samples

Returns:

Xdevice array of shape [n_samples, n_features]
The generated samples.

ydevice array of shape [n_samples]
The integer labels for cluster membership of each sample.

centersdevice array, shape [n_centers, n_features]
The centers of each cluster. Only returned if return_centers=True.

See also

make_classification
a more intricate variant

Examples
>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
>>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
cuml.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', _centroids=None, _informative_covariance=None, _redundant_covariance=None, _repeated_indices=None)[source]#
Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters:

n_samplesint, optional (default=100)
The number of samples.

n_featuresint, optional (default=20)
The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)
The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,), (default=None)
The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)
Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)
The order of the generated samples

dtypestr, optional (default=’float32’)
Dtype of the generated samples

_centroids: array of centroids of shape (n_clusters, n_informative)

_informative_covariance: array for covariance between informative features
of shape (n_clusters, n_informative, n_informative)

_redundant_covariance: array for covariance between redundant features
of shape (n_informative, n_redundant)

_repeated_indices: array of indices for the repeated features
of shape (n_repeated, )

Returns:

Xdevice array of shape [n_samples, n_features]
The generated samples.

ydevice array of shape [n_samples]
The integer labels for class membership of each sample.

Notes

The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset. How we optimized for GPUs:

Firstly, we generate X from a standard univariate instead of zeros. This saves memory as we don’t need to generate univariates each time for each feature class (informative, repeated, etc.) while also providing the added speedup of generating a big matrix on GPU

We generate order=F construction. We exploit the fact that X is a generated from a univariate normal, and covariance is introduced with matrix multiplications. Which means, we can generate X as a 1D array and just reshape it to the desired order, which only updates the metadata and eliminates copies

Lastly, we also shuffle by construction. Centroid indices are permuted for each sample, and then we construct the data for each centroid. This shuffle works for both order=C and order=F and eliminates any need for secondary copies

References

[1]
I. Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Examples
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=4,
...                            n_informative=2, n_classes=2,
...                            random_state=10)

>>> print(X)
[[-1.7974224   0.24425316  0.39062843 -0.38293394]
[ 0.6358963   1.4161923   0.06970507 -0.16085647]
[-0.22802866 -1.1827322   0.3525861   0.276615  ]
[ 1.7308872   0.43080002  0.05048406  0.29837844]
[-1.9465544   0.5704457  -0.8997551  -0.27898186]
[ 1.0575483  -0.9171263   0.09529338  0.01173469]
[ 0.7917619  -1.0638094  -0.17599393 -0.06420116]
[-0.6686142  -0.13951421 -0.6074711   0.21645583]
[-0.88968956 -0.914443    0.1302423   0.02924336]
[-0.8817671  -0.84549576  0.1845096   0.02556021]]

>>> print(y)
[1 0 1 1 1 1 1 1 1 0]
cuml.datasets.make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None, dtype='single', handle=None) → Union[Tuple[CumlArray, CumlArray], Tuple[CumlArray, CumlArray, CumlArray]][source]#
Generate a random regression problem.

See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html

Parameters:

n_samplesint, optional (default=100)
The number of samples.

n_featuresint, optional (default=2)
The number of features.

n_informativeint, optional (default=2)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)
The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)

if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:
The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=True)
Shuffle the samples and the features.

coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.

random_stateint, RandomState instance or None (default)
Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’.

handle: cuml.Handle
If it is None, a new one is created just for this function call

Returns:

outdevice array of shape [n_samples, n_features]
The input samples.

valuesdevice array of shape [n_samples, n_targets]
The output values.

coefdevice array of shape [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.

Examples
>>> from cuml.datasets.regression import make_regression
>>> from cuml.linear_model import LinearRegression

>>> # Create regression problem
>>> data, values = make_regression(n_samples=200, n_features=12,
...                                n_informative=7, bias=-4.2,
...                                noise=0.3, random_state=10)

>>> # Perform a linear regression on this problem
>>> lr = LinearRegression(fit_intercept = True, normalize = False,
...                       algorithm = "eig")
>>> reg = lr.fit(data, values)
>>> print(reg.coef_)
[-2.6980877e-02  7.7027252e+01  1.1498465e+01  8.5468025e+00
5.8548538e+01  6.0772545e+01  3.6876743e+01  4.0023815e+01
4.3908358e-03 -2.0275116e-02  3.5066366e-02 -3.4512520e-02]
cuml.datasets.make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0), intercept=False, random_state=None, dtype='double', handle=None)[source]#
Generates a dataset of time series by simulating an ARIMA process of a given order.

Parameters:

batch_size: int
Number of time series to generate

n_obs: int
Number of observations per series

orderTuple[int, int, int]
Order (p, d, q) of the simulated ARIMA process

seasonal_order: Tuple[int, int, int, int]
Seasonal ARIMA order (P, D, Q, s) of the simulated ARIMA process

intercept: bool or int
Whether to include a constant trend mu in the simulated ARIMA process

random_state: int, RandomState instance or None (default)
Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’

handle: cuml.Handle
If it is None, a new one is created just for this function call

Returns:

out: array-like, shape (n_obs, batch_size)
Array of the requested type containing the generated dataset

Examples
from cuml.datasets import make_arima
y = make_arima(1000, 100, (2,1,2), (0,1,2,12), 0)

Dataset Generation (Dask-based Multi-GPU)#

cuml.dask.datasets.blobs.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=(-10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None, workers=None)[source]#
Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.

This function calls make_blobs from cuml.datasets on each Dask worker and aggregates them into a single Dask Dataframe.

For more information on Scikit-learn’s make_blobs.

Parameters:

n_samplesint
number of rows

n_featuresint
number of features

centersint or array of shape [n_centers, n_features],
optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat (default = 1.0)
standard deviation of points around centroid

n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)

center_boxtuple (int, int) (default = (-10, 10))
the bounding box which constrains all the centroids

random_stateint (default = None)
sets random seed (or use None to reinitialize each time)

return_centersbool, optional (default=False)
If True, then return the centers of each cluster

verboseint or boolean (default = False)
Logging level.

shufflebool (default=False)
Shuffles the samples on each worker.

order: str, optional (default=’F’)
The order of the generated samples

dtypestr, optional (default=’float32’)
Dtype of the generated samples

clientdask.distributed.Client (optional)
Dask client to use

workersoptional, list of strings
Dask addresses of workers to use for computation. If None, all available Dask workers will be used. (e.g. : workers = list(client.scheduler_info()['workers'].keys()))

Returns:

Xdask.array backed by CuPy array of shape [n_samples, n_features]
The input samples.

ydask.array backed by CuPy array of shape [n_samples]
The output values.

centersdask.array backed by CuPy array of shape
[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.

Examples
>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> workers = list(client.scheduler_info()['workers'].keys())
>>> X, y = make_blobs(1000, 10, centers=42, cluster_std=0.1,
...                   workers=workers)

>>> client.close()
>>> cluster.close()
cuml.dask.datasets.classification.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', n_parts=None, client=None)[source]#
Generate a random n-class classification problem.

This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2 * class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters:

n_samplesint, optional (default=100)
The number of samples.

n_featuresint, optional (default=20)
The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)
The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,) , (default=None)
The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)
Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)
The order of the generated samples

dtypestr, optional (default=’float32’)
Dtype of the generated samples

n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)

Returns:

Xdask.array backed by CuPy array of shape [n_samples, n_features]
The generated samples.

ydask.array backed by CuPy array of shape [n_samples]
The integer labels for class membership of each sample.

Notes

How we extended the dask MNMG version from the single GPU version:

We generate centroids of shape (n_centroids, n_informative)

We generate an informative covariance of shape (n_centroids, n_informative, n_informative)

We generate a redundant covariance of shape (n_informative, n_redundant)

We generate the indices for the repeated features We pass along the references to the futures of the above arrays with each part to the single GPU cuml.datasets.classification.make_classification so that each part (and worker) has access to the correct values to generate data from the same covariances

Examples
>>> from dask.distributed import Client
>>> from dask_cuda import LocalCUDACluster
>>> from cuml.dask.datasets.classification import make_classification
>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)
>>> X, y = make_classification(n_samples=10, n_features=4,
...                            random_state=1, n_informative=2,
...                            n_classes=2)
>>> print(X.compute())
[[-1.1273878   1.2844919  -0.32349187  0.1595734 ]
[ 0.80521786 -0.65946865 -0.40753683  0.15538901]
[ 1.0404129  -1.481386    1.4241115   1.2664981 ]
[-0.92821544 -0.6805706  -0.26001272  0.36004275]
[-1.0392245  -1.1977317   0.16345565 -0.21848428]
[ 1.2273135  -0.529214    2.4799604   0.44108105]
[-1.9163864  -0.39505136 -1.9588828  -1.8881643 ]
[-0.9788184  -0.89851004 -0.08339313  0.1130247 ]
[-1.0549078  -0.8993015  -0.11921967  0.04821599]
[-1.8388828  -1.4063598  -0.02838472 -1.0874642 ]]
>>> print(y.compute())
[1 0 0 0 0 1 0 0 0 0]
>>> client.close()
>>> cluster.close()
cuml.dask.datasets.regression.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None, n_parts=1, n_samples_per_part=None, dtype='float32')[source]#

Generate a mostly low rank matrix with bell-shaped singular values

Parameters:

n_samplesint, optional (default=100)
The number of samples.

n_featuresint, optional (default=100)
The number of features.

effective_rankint, optional (default=10)
The approximate number of singular vectors required to explain most of the data by linear combinations.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile.

random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)
The number of parts of work.

dtype: str, optional (default=’float32’)
dtype of generated data

Returns:

XDask-CuPy array of shape [n_samples, n_features]
The matrix.

cuml.dask.datasets.regression.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=False, coef=False, random_state=None, n_parts=1, n_samples_per_part=None, order='F', dtype='float32', client=None, use_full_low_rank=True)[source]#

Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.

The output is generated by applying a (potentially biased) random linear regression model with “n_informative” nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

Parameters:

n_samplesint, optional (default=100)
The number of samples.

n_featuresint, optional (default=100)
The number of features.

n_informativeint, optional (default=10)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)
The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)

if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:
The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if “effective_rank” is not None.

noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=False)
Shuffle the samples and the features.

coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.

random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)
The number of parts of work.

orderstr, optional (default=’F’)
Row-major or Col-major

dtype: str, optional (default=’float32’)
dtype of generated data

use_full_low_rankboolean (default=True)
Whether to use the entire dataset to generate the low rank matrix. If False, it creates a low rank covariance and uses the corresponding covariance to generate a multivariate normal distribution on the remaining chunks

Returns:

XDask-CuPy array of shape [n_samples, n_features]
The input samples.

yDask-CuPy array of shape [n_samples] or [n_samples, n_targets]
The output values.

coefDask-CuPy array of shape [n_features] or [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.

Notes

Known Performance Limitations:

When effective_rank is set and use_full_low_rank is True, we cannot generate order F by construction, and an explicit transpose is performed on each part. This may cause memory to spike (other parameters make order F by construction)

When n_targets > 1 and order = 'F' as above, we have to explicitly transpose the y array. If coef = True, then we also explicitly transpose the ground_truth array

When shuffle = True and order = F, there are memory spikes to shuffle the F order arrays

Note

If out-of-memory errors are encountered in any of the above configurations, try increasing the n_parts parameter.

Metrics (regression, classification, and distance)#

cuml.metrics.regression.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')[source]#

Mean absolute error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters:

y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

Returns:

lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

cuml.metrics.regression.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]#

Mean squared error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters:

y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’] (default=’uniform_average’)
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.

Returns:

lossfloat or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.mean_squared_log_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]#

Mean squared log error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters:

y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.

Returns:

lossfloat or ndarray of floats
A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.median_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')[source]#

Median absolute error regression loss.

Median absolute error output is non-negative floating point. The best value is 0.0.

Parameters:

y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]
or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

Returns:

lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then median absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

cuml.metrics.regression.r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average', force_finite=True)[source]#

\(R^2\) (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a \(R^2\) score of 0.0.

Parameters:

y_truearray-like of shape (n_samples,) or (n_samples, n_outputs)
Ground truth (correct) target values.

y_predarray-like of shape (n_samples,) or (n_samples, n_outputs)
Estimated target values.

sample_weightarray-like of shape (n_samples,)
Sample weights.

multioutput{‘raw_values’, ‘uniform_average’, ‘variance_weighted’} or array-like of shape (n_outputs,)
How to aggregate multioutput scores. One of:

‘uniform_average’: Scores of all outputs are averaged with uniform weight. This is the default.

‘variance_weighted’: Scores of all outputs are averaged, weighted by the variances of each individual output.

‘raw_values’: Full set of scores in case of multioutput input.

array-like: Weights to use when averaging scores of all outputs.

force_finitebool, default=True
Flag indicating if NaN and -Inf scores resulting from constant data should be replaced with real numbers (1.0 if prediction is perfect, 0.0 otherwise). Default is True.

Returns:

zfloat or ndarray of floats
The \(R^2\) score or ndarray of scores if ‘multioutput’ is ‘raw_values’.

cuml.metrics.accuracy_score(y_true, y_pred, *, sample_weight=None, normalize=True)[source]#

Accuracy classification score.

Parameters:

y_truearray-like of shape (n_samples,)
Ground truth (correct) labels.

y_predarray-like of shape (n_samples,)
Predicted labels.

sample_weightarray-like of shape (n_samples,)
Sample weights.

normalizebool
If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.

Returns:

scorefloat
The fraction of correctly classified samples, or the number of correctly classified samples if normalize == False.

cuml.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None, normalize=None, convert_dtype=False) → CumlArray[source]#

Compute confusion matrix to evaluate the accuracy of a classification.

Parameters:

y_truearray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.

labelsarray-like (device or host) shape = (n_classes,), optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.

sample_weightarray-like (device or host) shape = (n_samples,), optional
Sample weights.

normalizestring in [‘true’, ‘pred’, ‘all’] or None (default=None)
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.

convert_dtypebool, optional (default=False)
When set to True, the confusion matrix method will automatically convert the predictions, ground truth, and labels arrays to np.int32.

Returns:

Carray-like (device or host) shape = (n_classes, n_classes)
Confusion matrix.

cuml.metrics.kl_divergence(P, Q, handle=None, convert_dtype=True)[source]#

Calculates the “Kullback-Leibler” Divergence The KL divergence tells us how well the probability distribution Q approximates the probability distribution P It is often also used as a ‘distance metric’ between two probability distributions (not symmetric)

Parameters:

PDense array of probabilities corresponding to distribution P
shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.

QDense array of probabilities corresponding to distribution Q
shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.

handlecuml.Handle

convert_dtypebool, optional (default = True)
When set to True, the method will, convert P and Q to be the same data type: float32. This will increase memory used for the method.

Returns:

float
The KL Divergence value
cuml.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None) → float[source]#
Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true. The log loss is only defined for two or more labels.

Parameters:

y_truearray-like, shape = (n_samples,)

y_predarray-like of float,
shape = (n_samples, n_classes) or (n_samples,)

epsfloat (default=1e-15)
Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).

normalizebool, optional (default=True)
If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.

sample_weightarray-like of shape (n_samples,), default=None
Sample weights.

Returns:

lossfloat

Notes

The logarithm used is the natural logarithm (base-e).

References

C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, p. 209.

Examples
>>> from cuml.metrics import log_loss
>>> import cupy as cp
>>> log_loss(cp.array([1, 0, 0, 1]),
...          cp.array([[.1, .9], [.9, .1], [.8, .2], [.35, .65]]))
0.21616...
cuml.metrics.roc_auc_score(y_true, y_score)[source]#
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Note

this implementation can only be used with binary classification.

Parameters:

y_truearray-like of shape (n_samples,)
True labels. The binary cases expect labels with shape (n_samples,)

y_scorearray-like of shape (n_samples,)
Target scores. In the binary cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.

Returns:

aucfloat

Examples
>>> import numpy as np
>>> from cuml.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> print(roc_auc_score(y_true, y_scores))
0.75
cuml.metrics.precision_recall_curve(y_true, probs_pred) → Tuple[CumlArray, CumlArray, CumlArray][source]#
Compute precision-recall pairs for different probability thresholds

Note

this implementation is restricted to the binary classification task. The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

Read more in the scikit-learn’s User Guide.

Parameters:

y_truearray, shape = [n_samples]
True binary labels, {0, 1}.

probas_predarray, shape = [n_samples]
Estimated probabilities or decision function.

Returns:

precisionarray, shape = [n_thresholds + 1]
Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.

recallarray, shape = [n_thresholds + 1]
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.

thresholdsarray, shape = [n_thresholds <= len(np.unique(probas_pred))]
Increasing thresholds on the decision function used to compute precision and recall.

Examples
>>> import cupy as cp
>>> from cuml.metrics import precision_recall_curve
>>> y_true = cp.array([0, 0, 1, 1])
>>> y_scores = cp.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, thresholds = precision_recall_curve(
...     y_true, y_scores)
>>> print(precision)
[0.666... 0.5  1.  1. ]
>>> print(recall)
[1. 0.5 0.5 0. ]
>>> print(thresholds)
[0.35 0.4 0.8 ]
cuml.metrics.pairwise_distances.nan_euclidean_distances(X, Y=None, *, squared=False, missing_values=cp.nan, convert_dtype=True)[source]#

Calculate the euclidean distances in the presence of missing values.

Compute the euclidean distance between each pair of samples in X and Y, where Y=X is assumed if Y=None. When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates:

dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates

For example, the distance between [3, na, na, 6] and [1, na, 4, 5] is:

\[\sqrt{\frac{4}{2}((3-1)^2 + (6-5)^2)}\]

If all the coordinates are missing or if there are no common present coordinates then NaN is returned for that pair.

Parameters:

XDense matrix of shape (n_samples_X, n_features)
Acceptable formats: cuDF DataFrame, Pandas DataFrame, NumPy ndarray, cuda array interface compliant array like CuPy.

YDense matrix of shape (n_samples_Y, n_features), default=None
Acceptable formats: cuDF DataFrame, Pandas DataFrame, NumPy ndarray, cuda array interface compliant array like CuPy.

squaredbool, default=False
Return squared Euclidean distances.

missing_valuesnp.nan or int, default=np.nan
Representation of missing value.

Returns:

distancesndarray of shape (n_samples_X, n_samples_Y)
Returns the distances between the row vectors of X and the row vectors of Y.
cuml.metrics.pairwise_distances.pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]#
Compute the distance matrix from a vector array X and optional Y.

This method takes either one or two vector arrays, and returns a distance matrix.

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

Valid values for metric are:

From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
Sparse matrices are supported, see ‘sparse_pairwise_distances’.

From scipy.spatial.distance: [‘sqeuclidean’]
See the documentation for scipy.spatial.distance for details on this metric. Sparse matrices are supported.

Parameters:

XDense or sparse matrix (device or host) of shape
(n_samples_x, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy, or cupyx.scipy.sparse for sparse input

Yarray-like (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”}
The metric to use when calculating distance between instances in a feature array.

convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

Returns:

Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.

Examples
>>> import cupy as cp
>>> from cuml.metrics import pairwise_distances
>>> X = cp.array([[2.0, 3.0], [3.0, 5.0], [5.0, 8.0]])
>>> Y = cp.array([[1.0, 0.0], [2.0, 1.0]])
>>> # Euclidean Pairwise Distance, Single Input:
>>> pairwise_distances(X, metric='euclidean')
array([[0.        , 2.236..., 5.830...],
    [2.236..., 0.        , 3.605...],
    [5.830..., 3.605..., 0.        ]])
>>> # Cosine Pairwise Distance, Multi-Input:
>>> pairwise_distances(X, Y, metric='cosine')
array([[0.445... , 0.131...],
    [0.485..., 0.156...],
    [0.470..., 0.146...]])
>>> # Manhattan Pairwise Distance, Multi-Input:
>>> pairwise_distances(X, Y, metric='manhattan')
array([[ 4.,  2.],
    [ 7.,  5.],
    [12., 10.]])
cuml.metrics.pairwise_distances.sparse_pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]#
Compute the distance matrix from a vector array X and optional Y.

This method takes either one or two sparse vector arrays, and returns a dense distance matrix.

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

Valid values for metric are:

From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

From scipy.spatial.distance: [‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘jaccard’, ‘chebyshev’, ‘dice’]
See the documentation for scipy.spatial.distance for details on these metrics.

[‘inner_product’, ‘hellinger’]

Parameters:

Xarray-like (device or host) of shape (n_samples_x, n_features)
Acceptable formats: SciPy or Cupy sparse array

Yarray-like (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: SciPy or Cupy sparse array

metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”, “canberra”, “lp”, “inner_product”, “minkowski”, “jaccard”, “hellinger”, “chebyshev”, “linf”, “dice”}
The metric to use when calculating distance between instances in a feature array.

convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

metric_argfloat, optional (default = 2)
Additional metric-specific argument. For Minkowski it’s the p-norm to apply.

Returns:

Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A dense distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.

Examples
>>> import cupyx
>>> from cuml.metrics import sparse_pairwise_distances

>>> X = cupyx.scipy.sparse.random(2, 3, density=0.5, random_state=9)
>>> Y = cupyx.scipy.sparse.random(1, 3, density=0.5, random_state=9)
>>> X.todense()
array([[0.8098..., 0.537..., 0. ],
    [0.        , 0.856..., 0. ]])
>>> Y.todense()
array([[0.        , 0.        , 0.993...]])
>>> # Cosine Pairwise Distance, Single Input:
>>> sparse_pairwise_distances(X, metric='cosine')
array([[0.      , 0.447...],
    [0.447..., 0.        ]])

>>> # Squared euclidean Pairwise Distance, Multi-Input:
>>> sparse_pairwise_distances(X, Y, metric='sqeuclidean')
array([[1.931...],
    [1.720...]])

>>> # Canberra Pairwise Distance, Multi-Input:
>>> sparse_pairwise_distances(X, Y, metric='canberra')
array([[3.],
    [2.]])
cuml.metrics.pairwise_kernels.pairwise_kernels(X, Y=None, metric='linear', *, filter_params=False, convert_dtype=True, **kwds)[source]#
Compute the kernel between arrays X and optional array Y. This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector array, the kernels are computed. If the input is a kernel matrix, it is returned instead. This method provides a safe way to take a kernel matrix as input, while preserving compatibility with many other algorithms that take a vector array. If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X and Y. Valid values for metric are: [‘additive_chi2’, ‘chi2’, ‘linear’, ‘poly’, ‘polynomial’, ‘rbf’, ‘laplacian’, ‘sigmoid’, ‘cosine’]

Parameters:

XDense matrix (device or host) of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)
Array of pairwise kernels between samples, or a feature array. The shape of the array should be (n_samples_X, n_samples_X) if metric == “precomputed” and (n_samples_X, n_features) otherwise. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

YDense matrix (device or host) of shape (n_samples_Y, n_features), default=None
A second feature array only if X has shape (n_samples_X, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

metricstr or callable (numba device function), default=”linear”
The metric to use when calculating kernel between instances in a feature array. If metric is “precomputed”, X is assumed to be a kernel matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two rows from X as input and return the corresponding kernel value as a single number.

filter_paramsbool, default=False
Whether to filter invalid parameters or not.

convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

**kwdsoptional keyword parameters
Any further parameters are passed directly to the kernel function.

Returns:

Kndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)
A kernel matrix K such that K_{i, j} is the kernel between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then K_{i, j} is the kernel between the ith array from X and the jth array from Y.

Notes

If metric is ‘precomputed’, Y is ignored and X is returned.

Examples
>>> import cupy as cp
>>> from cuml.metrics import pairwise_kernels
>>> from numba import cuda
>>> import math

>>> X = cp.array([[2, 3], [3, 5], [5, 8]])
>>> Y = cp.array([[1, 0], [2, 1]])

>>> pairwise_kernels(X, Y, metric='linear')
array([[ 2,  7],
    [ 3, 11],
    [ 5, 18]])
>>> @cuda.jit(device=True)
... def custom_rbf_kernel(x, y, gamma=None):
...     if gamma is None:
...         gamma = 1.0 / len(x)
...     sum = 0.0
...     for i in range(len(x)):
...         sum += (x[i] - y[i]) ** 2
...     return math.exp(-gamma * sum)

>>> pairwise_kernels(X, Y, metric=custom_rbf_kernel)
array([[6.73794700e-03, 1.35335283e-01],
    [5.04347663e-07, 2.03468369e-04],
    [4.24835426e-18, 2.54366565e-13]])

Metrics (clustering and manifold learning)#

cuml.metrics.trustworthiness.trustworthiness(X, X_embedded, handle=None, n_neighbors=5, metric='euclidean', convert_dtype=True, batch_size=512) → float[source]#

Expresses to what extent the local structure is retained in embedding. The score is defined in the range [0, 1].

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

X_embeddedarray-like (device or host) shape= (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

n_neighborsint, optional (default=5)
Number of neighbors considered

metricstr in [‘euclidean’] (default=’euclidean’)
Metric used to compute the trustworthiness. For the moment only ‘euclidean’ is supported.

convert_dtypebool, optional (default=False)
When set to True, the trustworthiness method will automatically convert the inputs to np.float32.

batch_sizeint (default=512)
The number of samples to use for each batch.

Returns:

trustworthiness scoredouble
Trustworthiness of the low-dimensional embedding

cuml.metrics.cluster.adjusted_rand_index.adjusted_rand_score(labels_true, labels_pred, handle=None, convert_dtype=True) → float[source]#

Adjusted_rand_score is a clustering similarity metric based on the Rand index and is corrected for chance.

Parameters:

labels_trueGround truth labels to be used as a reference

labels_predArray of predicted labels used to evaluate the model

handlecuml.Handle

Returns:

float
The adjusted rand index value between -1.0 and 1.0

cuml.metrics.cluster.entropy.cython_entropy(clustering, base=None, handle=None) → float[source]#

Computes the entropy of a distribution for given probability values.

Parameters:

clusteringarray-like (device or host) shape = (n_samples,)
Clustering of labels. Probabilities are computed based on occurrences of labels. For instance, to represent a fair coin (2 equally possible outcomes), the clustering could be [0,1]. For a biased coin with 2/3 probability for tail, the clustering could be [0, 0, 1].

base: float, optional
The logarithmic base to use, defaults to e (natural logarithm).

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns:

Sfloat
The calculated entropy.

cuml.metrics.cluster.homogeneity_score.cython_homogeneity_score(labels_true, labels_pred, handle=None) → float[source]#

Computes the homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters:

labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns:

float
The homogeneity of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling.

cuml.metrics.cluster.silhouette_score.cython_silhouette_samples(X, labels, metric='euclidean', chunksize=None, convert_dtype=True, handle=None)[source]#

Calculate the silhouette coefficient for each sample in the provided data.

Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).

Parameters:

Xarray-like, shape = (n_samples, n_features)
The feature vectors for all samples.

labelsarray-like, shape = (n_samples,)
The assigned cluster labels for each sample.

metricstring
A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.

chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

cuml.metrics.cluster.silhouette_score.cython_silhouette_score(X, labels, metric='euclidean', chunksize=None, convert_dtype=True, handle=None)[source]#

Calculate the mean silhouette coefficient for the provided data.

Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).

Parameters:

Xarray-like, shape = (n_samples, n_features)
The feature vectors for all samples.

labelsarray-like, shape = (n_samples,)
The assigned cluster labels for each sample.

metricstring
A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.

chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

cuml.metrics.cluster.completeness_score.cython_completeness_score(labels_true, labels_pred, handle=None) → float[source]#

Completeness metric of a cluster labeling given a ground truth.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters:

labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns:

float
The completeness of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling.

cuml.metrics.cluster.mutual_info_score.cython_mutual_info_score(labels_true, labels_pred, handle=None) → float[source]#

Computes the Mutual Information between two clusterings.

The Mutual Information is a measure of the similarity between two labels of the same data.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters:

handlecuml.Handle

labels_predarray-like (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns:

float
Mutual information, a non-negative value

cython_v_measure(labels_true, labels_pred, beta=1.0, handle=None) -> float

V-measure metric of a cluster labeling given a ground truth.

The V-measure is the harmonic mean between homogeneity and completeness:
v = (1 + beta) * homogeneity * completeness
     / (beta * homogeneity + completeness)
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Parameters#

labels_predarray-like (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

betafloat, default=1.0
Ratio of weight attributed to homogeneity vs completeness. If beta is greater than 1, completeness is weighted more strongly in the calculation. If beta is less than 1, homogeneity is weighted more strongly.

handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns#

v_measure_valuefloat
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling

Benchmarking#

class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]#

Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers fit and predict or transform methods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating

Parameters:

cpu_classclass
Class for CPU version of algorithm. Set to None if not available.

cuml_classclass
Class for cuML algorithm

shared_argsdict
Arguments passed to both implementations’s initializer

cuml_argsdict
Arguments only passed to cuml’s initializer

cpu_args dict
Arguments only passed to sklearn’s initializer

accepts_labelsboolean
If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.

data_prep_hookfunction (data -> data)
Optional function to run on input data before passing to fit

accuracy_functionfunction (y_test, y_pred)
Function that returns a scalar representing accuracy

bench_funccustom function to perform fit/predict/transform
calls.

Methods

run_cpu(data[, bench_args])

Runs the cpu-based algorithm's fit method on specified data

run_cuml(data[, bench_args])

Runs the cuml-based algorithm's fit method on specified data

setup_cpu

setup_cuml

run_cpu(data, bench_args={}, **override_setup_args)[source]#

Runs the cpu-based algorithm’s fit method on specified data

run_cuml(data, bench_args={}, **override_setup_args)[source]#

Runs the cuml-based algorithm’s fit method on specified data

cuml.benchmark.algorithms.algorithm_by_name(name)[source]#

Returns the algorithm pair with the name ‘name’ (case-insensitive)

cuml.benchmark.algorithms.all_algorithms()[source]#

Returns all defined AlgorithmPair objects

Wrappers to run ML benchmarks

class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]#

Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.
class cuml.benchmark.runners.BenchmarkTimer(reps=1)[source]#
Provides a context manager that runs a code block reps times and records results to the instance variable timings. Use like:
timer = BenchmarkTimer(rep=5)
for _ in timer.benchmark_runs():
    ... do something ...
print(np.min(timer.timings))
Methods

benchmark_runs
class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]#

Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.

Methods

run

cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], dtype=<class 'numpy.float32'>, input_type='numpy', test_fraction=0.1, run_cpu=True, raise_on_error=False, n_reps=1)[source]#

Runs each algo in algos once per bench_rows X bench_dims X params_override_list X cuml_param_override_list combination and returns a dataframe containing timing and accuracy data.

Parameters:

algosstr or list
Name of algorithms to run and evaluate

dataset_namestr
Name of dataset to use

bench_rowslist of int
Dataset row counts to test

bench_dimslist of int
Dataset column counts to test

param_override_listlist of dict
Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.

cuml_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cuml algo only.

cpu_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cpu algo only.

dataset_param_override_listdict
Dicts containing parameters to pass to dataset generator function

dtype: [np.float32|np.float64]
Specifies the dataset precision to be used for benchmarking.

test_fractionfloat
The fraction of data to use for testing.

run_cpuboolean
If True, run the cpu-based algorithm for comparison

Data generators for cuML benchmarks

The main entry point for consumers is gen_data, which wraps the underlying data generators.

Notes when writing new generators:

Each generator is a function that accepts:

n_samples (set to 0 for ‘default’)

n_features (set to 0 for ‘default’)

random_state

(and optional generator-specific parameters)

The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)

A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.

cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=0, n_features=0, test_fraction=0.0, datasets_root_dir='.', dtype=<class 'numpy.float32'>, **kwargs)[source]#

Returns a tuple of data from the specified generator.

Parameters:

dataset_namestr
Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)

dataset_formatstr
Type of data to return. (One of cudf, numpy, pandas, gpuarray)

n_samplesint
Total number of samples to loaded including training and testing samples

test_fractionfloat
Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.

Returns:

(train_features, train_labels, test_features, test_labels) tuple

containing matrices or dataframes of the requested format.

test_features and test_labels may be None if no splitting was done.

Regression and Classification#

Linear Regression#

class cuml.LinearRegression(*, algorithm='eig', fit_intercept=True, copy_X=True, normalize=False, handle=None, verbose=False, output_type=None)#

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.

Parameters:

algorithm{‘svd’, ‘eig’, ‘qr’, ‘svd-qr’, ‘svd-jacobi’}, (default = ‘eig’)

Choose an algorithm:

‘svd’ - alias for svd-jacobi;

‘eig’ - use an eigendecomposition of the covariance matrix;

‘qr’ - use QR decomposition algorithm and solve Rx = Q^T y

‘svd-qr’ - compute SVD decomposition using QR algorithm

‘svd-jacobi’ - compute SVD decomposition using Jacobi iterations.

Among these algorithms, only ‘svd-jacobi’ supports the case when the number of features is larger than the sample size; this algorithm is force-selected automatically in such a case.

For the broad range of inputs, ‘eig’ and ‘qr’ are usually the fastest, followed by ‘svd-jacobi’ and then ‘svd-qr’. In theory, SVD-based algorithms are more stable.

fit_interceptboolean (default = True)

If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.

copy_Xbool, default=True

If True, it is guaranteed that a copy of X is created, leaving the original X unchanged. However, if set to False, X may be modified directly, which would reduce the memory usage of the estimator.

normalizeboolean (default = False)

This parameter is ignored when fit_intercept is set to False. If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

Notes

LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.

Applications of LinearRegression

LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).

For additional information, see scikitlearn’s OLS documentation.

For an additional example see the OLS notebook.

Note

Starting from version 23.08, the new ‘copy_X’ parameter defaults to ‘True’, ensuring a copy of X is created after passing it to fit(), preventing any changes to the input, but with increased memory usage. This represents a change in behavior from previous versions. With copy_X=False a copy might still be created if necessary.

Examples

>>> import cupy as cp
>>> import cudf

>>> # Both import methods supported
>>> from cuml import LinearRegression
>>> from cuml.linear_model import LinearRegression
>>> lr = LinearRegression(fit_intercept = True, normalize = False,
...                       algorithm = "eig")
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)
>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))
>>> reg = lr.fit(X,y)
>>> print(reg.coef_)
0   1.0
1   2.0
dtype: float32
>>> print(reg.intercept_)
3.0...

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)
>>> preds = lr.predict(X_new)
>>> print(preds)
0   15.999...
1   14.999...
dtype: float32

fit(self, X, y, sample_weight=None, *, convert_dtype=True) → 'LinearRegression'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

Logistic Regression#

class cuml.LogisticRegression(*, penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, class_weight=None, max_iter=1000, linesearch_max_iter=50, verbose=False, l1_ratio=None, solver='qn', handle=None, output_type=None)[source]#

LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.

cuML’s LogisticRegression can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides both single-class (using sigmoid loss) and multiple-class (using softmax loss) variants, depending on the input variables

Only one solver option is currently available: Quasi-Newton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:

Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

Limited Memory BFGS (L-BFGS) otherwise.

Note that, just like in Scikit-learn, the bias will not be regularized.

Parameters:

penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = ‘l2’)

Used to specify the norm used in the penalization. If None or ‘l2’ are selected, then L-BFGS solver will be used. If ‘l1’ is selected, solver OWL-QN will be used. If ‘elasticnet’ is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used.

tolfloat (default = 1e-4)

Tolerance for stopping criteria. The exact stopping conditions depend on the chosen solver. Check the solver’s documentation for more details:

Quasi-Newton (L-BFGS/OWL-QN)

Cfloat (default = 1.0)

Inverse of regularization strength; must be a positive float.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

class_weightdict or ‘balanced’, default=None

By default all classes have a weight one. However, a dictionary can be provided with weights associated with classes in the form {class_label: weight}. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

max_iterint (default = 1000)

Maximum number of iterations taken for the solvers to converge.

linesearch_max_iterint (default = 50)

Max number of linesearch iterations per outer iteration used in the lbfgs and owl QN solvers.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

l1_ratiofloat or None, optional (default=None)

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1

solver‘qn’ (default=’qn’)

Algorithm to use in the optimization problem. Currently only qn is supported, which automatically selects either L-BFGS or OWL-QN depending on the conditions of the l1 regularization described above.

handlecuml.Handle

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

coef_: dev array, dim (n_classes, n_features) or (n_classes, n_features+1): The estimated coefficients for the logistic regression model.
intercept_: device array (n_classes, 1): The independent term. If fit_intercept is False, will be 0.
n_iter_: array, shape (1,): The number of iterations taken for the solvers to converge.

Methods

`decision_function`(X, *[, convert_dtype])	Gives confidence score for X
`fit`(X, y[, sample_weight, convert_dtype])	Fit the model with X and y.
`predict`(X, *[, convert_dtype])	Predicts the y for X.
`predict_log_proba`(X, *[, convert_dtype])	Predicts the log class probabilities for each class in X
`predict_proba`(X, *[, convert_dtype])	Predicts the class probabilities for each class in X
`set_params`(**params)	Accepts a dict of params and updates the corresponding ones owned by this class.

Notes

cuML’s LogisticRegression uses a different solver that the equivalent Scikit-learn, except when there is no penalty and solver=lbfgs is used in Scikit-learn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to using different solvers in Scikit-learn.

For additional information, see Scikit-learn’s LogisticRegression.

Examples

>>> import cudf
>>> import numpy as np

>>> # Both import methods supported
>>> # from cuml import LogisticRegression
>>> from cuml.linear_model import LogisticRegression

>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([1,1,2,2], dtype = np.float32)
>>> X['col2'] = np.array([1,2,2,3], dtype = np.float32)
>>> y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32))

>>> reg = LogisticRegression()
>>> reg.fit(X,y)
LogisticRegression()
>>> print(reg.coef_)
         0         1
0  0.69861  0.570058
>>> print(reg.intercept_)
0   -2.188...
dtype: float32

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = np.array([1,5], dtype = np.float32)
>>> X_new['col2'] = np.array([2,5], dtype = np.float32)

>>> preds = reg.predict(X_new)

>>> print(preds)
0    0.0
1    1.0
dtype: float32

decision_function(X, *, convert_dtype=True) → CumlArray[source]#

Gives confidence score for X

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the decision_function method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

scorecuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Confidence score

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(X, y, sample_weight=None, *, convert_dtype=True) → LogisticRegression[source]#

Fit the model with X and y.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True) → CumlArray[source]#

Predicts the y for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba(X, *, convert_dtype=True) → CumlArray[source]#

Predicts the log class probabilities for each class in X

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict_log_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Logaright of predicted class probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(X, *, convert_dtype=True) → CumlArray[source]#

Predicts the class probabilities for each class in X

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Predicted class probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(**params)[source]#: Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the _get_param_names method and does not need anything other than what is, there in this method, then it doesn’t have to override this method

Ridge Regression#

class cuml.Ridge(*, alpha=1.0, solver='auto', fit_intercept=True, normalize=False, handle=None, output_type=None, verbose=False)#

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s Ridge can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides 2 algorithms: SVD and Eig to fit a linear model. In general SVD uses significantly more memory and is slower than Eig. If using CUDA 10.1, the memory difference is even bigger than in the other supported CUDA versions. However, SVD is more stable than Eig (default).

Parameters:

alphafloat (default = 1.0): Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
solver{‘auto’, ‘eig’, ‘svd’} (default = ‘auto’): Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.
fit_interceptboolean (default = True): If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalizeboolean (default = False): If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

Notes

Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretability on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.

Applications of Ridge

Ridge Regression is used in the same way as LinearRegression, but does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.

For additional docs, see Scikit-learn’s Ridge Regression.

Examples

>>> import cupy as cp
>>> import cudf

>>> # Both import methods supported
>>> from cuml import Ridge
>>> from cuml.linear_model import Ridge

>>> alpha = 1e-5
>>> ridge = Ridge(alpha=alpha, fit_intercept=True, normalize=False,
...               solver="eig")

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)

>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))

>>> result_ridge = ridge.fit(X, y)
>>> print(result_ridge.coef_)
0 1.000...
1 1.999...
>>> print(result_ridge.intercept_)
3.0...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)
>>> preds = result_ridge.predict(X_new)
>>> print(preds)
0 15.999...
1 14.999...

fit(self, X, y, sample_weight=None, *, convert_dtype=True) → 'Ridge'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

Lasso Regression#

class cuml.Lasso(*, alpha=1.0, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)[source]#

Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.

cuML’s Lasso can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It uses coordinate descent to fit a linear model.

Parameters:

alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

solver{‘cd’, ‘qn’} (default=’cd’)

Choose an algorithm:

‘cd’ - coordinate descent

‘qn’ - quasi-newton

You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Notes

For additional docs, see scikitlearn’s Lasso.

Examples

>>> import numpy as np
>>> import cudf
>>> from cuml.linear_model import Lasso
>>> ls = Lasso(alpha = 0.1, solver='qn')
>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([0, 1, 2], dtype = np.float32)
>>> X['col2'] = np.array([0, 1, 2], dtype = np.float32)
>>> y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )
>>> result_lasso = ls.fit(X, y)
>>> print(result_lasso.coef_)
0   0.425
1   0.425
dtype: float32
>>> print(result_lasso.intercept_)
0.150000...

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = np.array([3,2], dtype = np.float32)
>>> X_new['col2'] = np.array([5,5], dtype = np.float32)
>>> preds = result_lasso.predict(X_new)
>>> print(preds)
0   3.549997
1   3.124997
dtype: float32

ElasticNet Regression#

class cuml.ElasticNet(*, alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)[source]#

ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters:

alphafloat (default = 1.0)

l1_ratiofloat (default = 0.5)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

solver{‘cd’, ‘qn’} (default=’cd’)

Choose an algorithm:

‘cd’ - coordinate descent

‘qn’ - quasi-newton

You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

handlecuml.Handle

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Methods

`fit`(X, y[, sample_weight, convert_dtype])	Fit the model with X and y.
`set_params`(**params)	Accepts a dict of params and updates the corresponding ones owned by this class.

Notes

For additional docs, see scikitlearn’s ElasticNet.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import ElasticNet
>>> enet = ElasticNet(alpha = 0.1, l1_ratio=0.5, solver='qn')
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([0, 1, 2], dtype = cp.float32)
>>> X['col2'] = cp.array([0, 1, 2], dtype = cp.float32)
>>> y = cudf.Series(cp.array([0.0, 1.0, 2.0], dtype = cp.float32) )
>>> result_enet = enet.fit(X, y)
>>> print(result_enet.coef_)
0    0.445...
1    0.445...
dtype: float32
>>> print(result_enet.intercept_)
0.108433...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype = cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype = cp.float32)
>>> preds = result_enet.predict(X_new)
>>> print(preds)
0    3.674...
1    3.228...
dtype: float32

fit(X, y, sample_weight=None, *, convert_dtype=True) → ElasticNet[source]#

Fit the model with X and y.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

set_params(**params)[source]#: Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the _get_param_names method and does not need anything other than what is, there in this method, then it doesn’t have to override this method

Mini Batch SGD Classifier#

class cuml.MBSGDClassifier(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)[source]#

Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier:

Reduce the batch size
Increase the eta0
Increase the number of iterations

Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.

Parameters:

loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘hinge’)

‘hinge’ uses linear SVM

‘log’ uses logistic regression

‘squared_loss’ uses linear regression

penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = ‘l2’)

The penalty (aka regularization term) to apply.

‘l1’: L1 norm (Lasso) regularization
‘l2’: L2 norm (Ridge) regularization (the default)
‘elasticnet’: Elastic Net regularization, a weighted average of L1 and L2
None: no penalty is added

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

l1_ratiofloat (default=0.15)

The l1_ratio is used only when penalty = elasticnet. The value for l1_ratio should be 0 <= l1_ratio <= 1. When l1_ratio = 0 then the penalty = 'l2' and if l1_ratio = 1 then penalty = 'l1'

batch_sizeint (default = 32)

It sets the number of samples that will be included in each batch.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any improvement in the model

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

classes_
coef_
dtype
intercept_

Methods

`fit`(X, y, *[, convert_dtype])	Fit the model with X and y.
`predict`(X, *[, convert_dtype])	Predicts the y for X.
`set_params`(**params)	Accepts a dict of params and updates the corresponding ones owned by this class.

Notes

For additional docs, see scikitlearn’s SGDClassifier.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import MBSGDClassifier
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)
>>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32)
>>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32)
>>> cu_mbsgd_classifier = MBSGDClassifier(learning_rate='constant',
...                                       eta0=0.05, epochs=2000,
...                                       fit_intercept=True,
...                                       batch_size=1, tol=0.0,
...                                       penalty='l2',
...                                       loss='squared_loss',
...                                       alpha=0.5)
>>> cu_mbsgd_classifier.fit(X, y)
MBSGDClassifier()
>>> print("cuML intercept : ", cu_mbsgd_classifier.intercept_)
cuML intercept :  0.725...
>>> print("cuML coef : ", cu_mbsgd_classifier.coef_)
cuML coef :  0    0.273...
1    0.182...
dtype: float32
>>> cu_pred = cu_mbsgd_classifier.predict(pred_data)
>>> print("cuML predictions : ", cu_pred)
cuML predictions :  0   1.0
1    1.0
dtype: float32

fit(X, y, *, convert_dtype=True) → MBSGDClassifier[source]#

Fit the model with X and y.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True) → CumlArray[source]#

Predicts the y for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(**params)[source]#: Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the _get_param_names method and does not need anything other than what is, there in this method, then it doesn’t have to override this method

Mini Batch SGD Regressor#

class cuml.MBSGDRegressor(*, loss='squared_loss', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)[source]#

Linear regression model fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Regressor implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGD Regressor:

Reduce the batch size
Increase the eta0
Increase the number of iterations

Parameters:

loss‘squared_loss’ (default = ‘squared_loss’)

‘squared_loss’ uses linear regression

penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = ‘l2’)

The penalty (aka regularization term) to apply.

‘l1’: L1 norm (Lasso) regularization
‘l2’: L2 norm (Ridge) regularization (the default)
‘elasticnet’: Elastic Net regularization, a weighted average of L1 and L2
None: no penalty is added

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_ratiofloat (default=0.15)

The l1_ratio is used only when penalty = elasticnet. The value for l1_ratio should be 0 <= l1_ratio <= 1. When l1_ratio = 0 then the penalty = 'l2' and if l1_ratio = 1 then penalty = 'l1'

batch_sizeint (default = 32)

It sets the number of samples that will be included in each batch.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any improvement in the model

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

coef_
dtype
intercept_

Methods

`fit`(X, y, *[, convert_dtype])	Fit the model with X and y.
`predict`(X, *[, convert_dtype])	Predicts the y for X.
`set_params`(**params)	Accepts a dict of params and updates the corresponding ones owned by this class.

Notes

For additional docs, see scikitlearn’s SGDRegressor.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)
>>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32)
>>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32)
>>> cu_mbsgd_regressor = cumlMBSGDRegressor(learning_rate='constant',
...                                         eta0=0.05, epochs=2000,
...                                         fit_intercept=True,
...                                         batch_size=1, tol=0.0,
...                                         penalty='l2',
...                                         loss='squared_loss',
...                                         alpha=0.5)
>>> cu_mbsgd_regressor.fit(X, y)
MBSGDRegressor()
>>> print("cuML intercept : ", cu_mbsgd_regressor.intercept_)
cuML intercept :  0.725...
>>> print("cuML coef : ", cu_mbsgd_regressor.coef_)
cuML coef :  0    0.273...
1     0.182...
dtype: float32
>>> cu_pred = cu_mbsgd_regressor.predict(pred_data)
>>> print("cuML predictions : ", cu_pred)
cuML predictions :  0    2.456...
1    2.183...
dtype: float32

fit(X, y, *, convert_dtype=True) → MBSGDRegressor[source]#

Fit the model with X and y.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True) → CumlArray[source]#

Predicts the y for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(**params)[source]#: Accepts a dict of params and updates the corresponding ones owned by this class. If the child class has appropriately overridden the _get_param_names method and does not need anything other than what is, there in this method, then it doesn’t have to override this method

Multiclass Classification#

class cuml.multiclass.MulticlassClassifier(estimator, *, handle=None, verbose=False, output_type=None, strategy='ovr')[source]#

Wrapper around scikit-learn multiclass classifiers that allows to choose different multiclass strategies.

The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue rapidsai/cuml#2876.

Parameters:

estimatorcuML estimator
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
strategy: string {‘ovr’, ‘ovo’}, default=’ovr’: Multiclass classification strategy: ‘ovr’: one vs. rest or ‘ovo’: one vs. one

Attributes:

classes_float, shape (n_classes_): Array of class labels.
n_classes_int: Number of classes.

Methods

`decision_function`(X)	Calculate the decision function.
`fit`(X, y)	Fit a multiclass classifier.
`predict`(X)	Predict using multi class classifier.

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import MulticlassClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = MulticlassClassifier(LogisticRegression(), strategy='ovo')
>>> cls.fit(X, y)
MulticlassClassifier(estimator=LogisticRegression())
>>> cls.predict(X)
array([1, 1, 0, 1, 1, 1, 2, 2, 1, 2])

decision_function(X) → CumlArray[source]#

Calculate the decision function.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Decision function values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(X, y) → MulticlassClassifier[source]#

Fit a multiclass classifier.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

predict(X) → CumlArray[source]#

Predict using multi class classifier.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class cuml.multiclass.OneVsOneClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]#

Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

For documentation see scikit-learn’s OneVsOneClassifier.

Parameters:

estimatorcuML estimator
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import OneVsOneClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = OneVsOneClassifier(LogisticRegression())
>>> cls.fit(X, y)
OneVsOneClassifier(estimator=LogisticRegression())
>>> cls.predict(X)
array([1, 1, 0, 1, 1, 1, 2, 2, 1, 2])

class cuml.multiclass.OneVsRestClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]#

Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

For documentation see scikit-learn’s OneVsRestClassifier.

Parameters:

estimatorcuML estimator
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import OneVsRestClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = OneVsRestClassifier(LogisticRegression())
>>> cls.fit(X, y)
OneVsRestClassifier(estimator=LogisticRegression())
>>> cls.predict(X)
array([1, 1, 0, 1, 1, 1, 2, 2, 1, 2])

Naive Bayes#

class cuml.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#

Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Parameters:

alphafloat (default=1.0): Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
fit_priorboolean (default=True): Whether to learn class prior probabilities or no. If false, a uniform prior will be used.
class_priorarray-like, size (n_classes) (default=None): Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

class_count_ndarray of shape (n_classes): Number of samples encountered for each class during fitting.
class_log_prior_ndarray of shape (n_classes): Log probability of each class (smoothed).
classes_ndarray of shape (n_classes,): Class labels known to the classifier
feature_count_ndarray of shape (n_classes, n_features): Number of samples encountered for each (class, feature) during fitting.
feature_log_prob_ndarray of shape (n_classes, n_features): Empirical log probability of features given a class, P(x_i|y).
n_features_int: Number of features of each sample.

Examples

Load the 20 newsgroups dataset from Scikit-learn and train a Naive Bayes classifier.

>>> import cupy as cp
>>> import cupyx
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from cuml.naive_bayes import MultinomialNB

>>> # Load corpus
>>> twenty_train = fetch_20newsgroups(subset='train', shuffle=True,
...                                   random_state=42)

>>> # Turn documents into term frequency vectors

>>> count_vect = CountVectorizer()
>>> features = count_vect.fit_transform(twenty_train.data)

>>> # Put feature vectors and labels on the GPU

>>> X = cupyx.scipy.sparse.csr_matrix(features.tocsr(),
...                                   dtype=cp.float32)
>>> y = cp.asarray(twenty_train.target, dtype=cp.int32)

>>> # Train model

>>> model = MultinomialNB()
>>> model.fit(X, y)
MultinomialNB()

>>> # Compute accuracy on training set

>>> model.score(X, y)
0.9245...

class cuml.naive_bayes.BernoulliNB(*, alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#

Naive Bayes classifier for multivariate Bernoulli models. Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

Parameters:

alphafloat, default=1.0: Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
binarizefloat or None, default=0.0: Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
fit_priorbool, default=True: Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
class_priorarray-like of shape (n_classes,), default=None: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

class_count_ndarray of shape (n_classes): Number of samples encountered for each class during fitting.
class_log_prior_ndarray of shape (n_classes): Log probability of each class (smoothed).
classes_ndarray of shape (n_classes,): Class labels known to the classifier
feature_count_ndarray of shape (n_classes, n_features): Number of samples encountered for each (class, feature) during fitting.
feature_log_prob_ndarray of shape (n_classes, n_features): Empirical log probability of features given a class, P(x_i|y).
n_features_int: Number of features of each sample.

References

C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html A. McCallum and K. Nigam (1998). A comparison of event models for naive Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with naive Bayes – Which naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).

Examples

>>> import cupy as cp
>>> rng = cp.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100), dtype=cp.int32)
>>> Y = cp.array([1, 2, 3, 4, 4, 5])
>>> from cuml.naive_bayes import BernoulliNB
>>> clf = BernoulliNB()
>>> clf.fit(X, Y)
BernoulliNB()
>>> print(clf.predict(X[2:3]))
[3]

class cuml.naive_bayes.ComplementNB(*, alpha=1.0, fit_prior=True, class_prior=None, norm=False, output_type=None, handle=None, verbose=False)[source]#

The Complement Naive Bayes classifier described in Rennie et al. (2003). The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets.

Parameters:

alphafloat, default=1.0: Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
fit_priorbool, default=True: Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
class_priorarray-like of shape (n_classes,), default=None: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
normbool, default=False: Whether or not a second normalization of the weights is performed. The default behavior mirrors the implementation found in Mahout and Weka, which do not follow the full algorithm described in Table 9 of the paper.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

class_count_ndarray of shape (n_classes): Number of samples encountered for each class during fitting.
class_log_prior_ndarray of shape (n_classes): Log probability of each class (smoothed).
classes_ndarray of shape (n_classes,): Class labels known to the classifier
feature_count_ndarray of shape (n_classes, n_features): Number of samples encountered for each (class, feature) during fitting.
feature_log_prob_ndarray of shape (n_classes, n_features): Empirical log probability of features given a class, P(x_i|y).
n_features_int: Number of features of each sample.

References

Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text classifiers. In ICML (Vol. 3, pp. 616-623). https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Examples

>>> import cupy as cp
>>> rng = cp.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100), dtype=cp.int32)
>>> Y = cp.array([1, 2, 3, 4, 4, 5])
>>> from cuml.naive_bayes import ComplementNB
>>> clf = ComplementNB()
>>> clf.fit(X, Y)
ComplementNB()
>>> print(clf.predict(X[2:3]))
[3]

class cuml.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09, output_type=None, handle=None, verbose=False)[source]#

Gaussian Naive Bayes (GaussianNB) Can perform online updates to model parameters via partial_fit(). For details on algorithm used to update feature means and variance online, see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

Parameters:

priorsarray-like of shape (n_classes,): Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothingfloat, default=1e-9: Portion of the largest variance of all features that is added to variances for calculation stability.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

`fit`(X, y[, sample_weight])	Fit Gaussian Naive Bayes classifier according to X, y
`partial_fit`(X, y[, classes, sample_weight])	Incremental fit on a batch of samples.

Examples

>>> import cupy as cp
>>> X = cp.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1],
...                 [3, 2]], cp.float32)
>>> Y = cp.array([1, 1, 1, 2, 2, 2], cp.float32)
>>> from cuml.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB()
>>> print(clf.predict(cp.array([[-0.8, -1]], cp.float32)))
[1]
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, cp.unique(Y))
GaussianNB()
>>> print(clf_pf.predict(cp.array([[-0.8, -1]], cp.float32)))
[1]

fit(X, y, sample_weight=None) → GaussianNB[source]#

Fit Gaussian Naive Bayes classifier according to X, y

Parameters:

X{array-like, cupy sparse matrix} of shape (n_samples, n_features): Training vectors, where n_samples is the number of samples and n_features is the number of features.
yarray-like shape (n_samples) Target values.
sample_weightarray-like of shape (n_samples): Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

partial_fit(X, y, classes=None, sample_weight=None) → GaussianNB[source]#

Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

Parameters:

X{array-like, cupy sparse matrix} of shape (n_samples, n_features): Training vectors, where n_samples is the number of samples and n_features is the number of features. A sparse matrix in COO format is preferred, other formats will go through a conversion to COO.
yarray-like of shape (n_samples) Target values.
classesarray-like of shape (n_classes): List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
sample_weightarray-like of shape (n_samples): Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns:

selfobject

class cuml.naive_bayes.CategoricalNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]#

Naive Bayes classifier for categorical features The categorical Naive Bayes classifier is suitable for classification with discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

Parameters:

alphafloat, default=1.0: Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
fit_priorbool, default=True: Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
class_priorarray-like of shape (n_classes,), default=None: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

category_count_ndarray of shape (n_features, n_classes, n_categories): With n_categories being the highest category of all the features. This array provides the number of samples encountered for each feature, class and category of the specific feature.
class_count_ndarray of shape (n_classes,): Number of samples encountered for each class during fitting.
class_log_prior_ndarray of shape (n_classes,): Smoothed empirical log probability for each class.
classes_ndarray of shape (n_classes,): Class labels known to the classifier
feature_log_prob_ndarray of shape (n_features, n_classes, n_categories): With n_categories being the highest category of all the features. Each array of shape (n_classes, n_categories) provides the empirical log probability of categories given the respective feature and class, P(x_i|y). This attribute is not available when the model has been trained with sparse data.
n_features_int: Number of features of each sample.

Methods

`fit`(X, y[, sample_weight])	Fit Naive Bayes classifier according to X, y
`partial_fit`(X, y[, classes, sample_weight])	Incremental fit on a batch of samples.

Examples

>>> import cupy as cp
>>> rng = cp.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100), dtype=cp.int32)
>>> y = cp.array([1, 2, 3, 4, 5, 6])
>>> from cuml.naive_bayes import CategoricalNB
>>> clf = CategoricalNB()
>>> clf.fit(X, y)
CategoricalNB()
>>> print(clf.predict(X[2:3]))
[3]

fit(X, y, sample_weight=None) → CategoricalNB[source]#

Fit Naive Bayes classifier according to X, y

Parameters:

Xarray-like of shape (n_samples, n_features): Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.
yarray-like of shape (n_samples,): Target values.
sample_weightarray-like of shape (n_samples), default=None: Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns:

selfobject

partial_fit(X, y, classes=None, sample_weight=None) → CategoricalNB[source]#

Parameters:

Xarray-like of shape (n_samples, n_features): Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.
yarray-like of shape (n_samples): Target values.
classesarray-like of shape (n_classes), default=None: List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
sample_weightarray-like of shape (n_samples), default=None: Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns:

selfobject

Stochastic Gradient Descent#

class cuml.SGD(*, loss='squared_loss', penalty=None, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, output_type=None, verbose=False)#

Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.

cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.

Parameters:

loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)

‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression

penalty{‘l1’, ‘l2’, ‘elasticnet’, None} (default = None)

The penalty (aka regularization term) to apply.

‘l1’: L1 norm (Lasso) regularization
‘l2’: L2 norm (Ridge) regularization
‘elasticnet’: Elastic Net regularization, a weighted average of L1 and L2
None: no penalty is added (the default)

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

batch_sizeint (default=32)

The number of samples to use for each batch.

learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)

Optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5

n_iter_no_changeint (default = 5)

The number of epochs to train without any improvement in the model

handlecuml.Handle

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

classes_
coef_

Methods

`fit`(self, X, y, *[, convert_dtype])	Fit the model with X and y.
`predict`(self, X, *[, convert_dtype])	Predicts the y for X.
`predictClass`(self, X[, convert_dtype])	Predicts the y for X.

Examples

>>> import numpy as np
>>> import cudf
>>> from cuml.solvers import SGD as cumlSGD
>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([1,1,2,2], dtype=np.float32)
>>> X['col2'] = np.array([1,2,2,3], dtype=np.float32)
>>> y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = np.asarray([3, 2], dtype=np.float32)
>>> pred_data['col2'] = np.asarray([5, 5], dtype=np.float32)
>>> cu_sgd = cumlSGD(learning_rate='constant', eta0=0.005, epochs=2000,
...                  fit_intercept=True, batch_size=2,
...                  tol=0.0, penalty=None, loss='squared_loss')
>>> cu_sgd.fit(X, y)
SGD()
>>> cu_pred = cu_sgd.predict(pred_data).to_numpy()
>>> print(" cuML intercept : ", cu_sgd.intercept_)
cuML intercept :  0.00418...
>>> print(" cuML coef : ", cu_sgd.coef_)
cuML coef :  0      0.9841...
1      0.0097...
dtype: float32
>>> print("cuML predictions : ", cu_pred)
cuML predictions :  [3.0055...  2.0214...]

fit(self, X, y, *, convert_dtype=True) → 'SGD'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Predicts the y for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predictClass(self, X, convert_dtype=True) → CumlArray[source]#

Predicts the y for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predictClass method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Random Forest#

class cuml.ensemble.RandomForestClassifier(*, split_criterion='gini', handle=None, verbose=False, output_type=None, **kwargs)[source]#

Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

Note

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the n_bins parameter.

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters:

n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 i 0.11)

split_criterionstr or int (default = 'gini')

The criterion used to split nodes.

'gini' or 0 for gini impurity

'entropy' or 1 for information gain (entropy)

bootstrapboolean (default = True)

Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement.

If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_features{‘sqrt’, ‘log2’, None}, int or float (default = ‘sqrt’)

The number of features to consider per node split:

If an int then max_features is the absolute count of features to be used.
If a float then max_features is used as a fraction.
If 'sqrt' then max_features=1/sqrt(n_features).
If 'log2' then max_features=log2(n_features)/n_features.
If None then max_features=n_features

Changed in version 24.06: The default of max_features changed from "auto" to "sqrt".

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4)

Number of parallel streams used for forest building.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

If type int, then min_samples_leaf represents the minimum number.

If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

If type int, then min_samples_split represents the minimum number.

If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

Minimum decrease in impurity required for node to be split.

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

classes_

Methods

`fit`(X, y, *[, convert_dtype])	Perform Random Forest Classification on the input data
`predict`(X, *[, threshold, convert_dtype, ...])	Predicts the labels for X.
`predict_proba`(X, *[, convert_dtype, layout, ...])	Predicts class probabilities for X.
`score`(X, y, *[, threshold, convert_dtype, ...])	Calculates the accuracy metric score of the model for X.

Notes

While training the model for multi class classification problems, using deep trees or max_features=1.0 provides better performance.

For additional docs, see scikitlearn’s RandomForestClassifier.

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestClassifier as cuRFC

>>> X = cp.random.normal(size=(10,4)).astype(cp.float32)
>>> y = cp.asarray([0,1]*5, dtype=cp.int32)

>>> cuml_model = cuRFC(max_features=1.0,
...                    n_bins=8,
...                    n_estimators=40)
>>> cuml_model.fit(X,y)
RandomForestClassifier()
>>> cuml_predict = cuml_model.predict(X)

>>> print("Predicted labels : ", cuml_predict)
Predicted labels :  [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]

fit(X, y, *, convert_dtype=True) → RandomForestClassifier[source]#

Perform Random Forest Classification on the input data

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.
dtypebool, optional (default = True): When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.

predict(X, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, predict_model='deprecated') → CumlArray[source]#

Predicts the labels for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
thresholdfloat (default = 0.5): Threshold used for classification.
convert_dtypebool (default = True): When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
layoutstring (default = ‘depth_first’): Forest layout for GPU inference. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Controls batch subdivision for parallel processing. Optimal value depends on hardware, model and batch size. If None, determined automatically.
align_bytesint, optional (default = None): If specified, trees will be padded to this byte alignment, which can improve performance. Typical values are 0 or 128 on GPU.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. To infer on CPU use model.as_fil to get a FIL instance which may then be used to perform inference on both CPU and GPU.

Returns:

ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

predict_proba(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None) → CumlArray[source]#

Predicts class probabilities for X. This function uses the GPU implementation of predict.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool (default = True): When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.

Returns:

ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

score(X, y, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, predict_model='deprecated')[source]#

Calculates the accuracy metric score of the model for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
thresholdfloat (default = 0.5): Threshold used for classification predictions
convert_dtypebool (default = True): When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. To infer on CPU use model.as_fil to get a FIL instance which may then be used to perform inference on both CPU and GPU.

Returns:

accuracyfloat: Accuracy of the model [0.0 - 1.0]

class cuml.ensemble.RandomForestRegressor(*, split_criterion='mse', max_features=1.0, accuracy_metric='deprecated', handle=None, verbose=False, output_type=None, **kwargs)[source]#

Implements a Random Forest regressor model which fits multiple decision trees in an ensemble.

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters:

n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

split_criterionstr or int (default = 'mse')

The criterion used to split nodes.

'mse' or 2 for mean squared error

'poisson' or 4 for poisson half deviance

'gamma' or 5 for gamma half deviance

'inverse_gaussian' or 6 for inverse gaussian deviance

bootstrapboolean (default = True)

Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement.

If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_features{‘sqrt’, ‘log2’, None}, int or float (default = 1.0)

The number of features to consider per node split:

If an int then max_features is the absolute count of features to be used.
If a float then max_features is used as a fraction.
If 'sqrt' then max_features=1/sqrt(n_features).
If 'log2' then max_features=log2(n_features)/n_features.
If None then max_features=n_features

Changed in version 24.06: The default of max_features changed from "auto" to 1.0.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

If type int, then min_samples_leaf represents the minimum number.

If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

If type int, then min_samples_split represents the minimum number.

If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

The minimum decrease in impurity required for node to be split

accuracy_metricstring (default = ‘deprecated’)

Decides the metric used to evaluate the performance of the model.

for r-squared : 'r2' (default)
for median of abs error : 'median_ae'
for mean of abs error : 'mean_ae'
for mean square error’ : 'mse'

Deprecated since version 25.10: accuracy_metric is deprecated and will be removed in 25.12. To evaluate models with metrics other than r2, please call the respective metric function from cuml.metrics directly.

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Methods

`fit`(X, y, *[, convert_dtype])	Perform Random Forest Regression on the input data
`predict`(X, *[, convert_dtype, layout, ...])	Predicts the values for X.
`score`(X, y, *[, convert_dtype, layout, ...])	Calculates the accuracy metric score of the model for X.

Notes

For additional docs, see scikitlearn’s RandomForestRegressor.

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestRegressor as curfr
>>> X = cp.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=cp.float32)
>>> y = cp.asarray([0.0,1.0,2.0,3.0], dtype=cp.float32)
>>> cuml_model = curfr(max_features=1.0, n_bins=128,
...                    min_samples_leaf=1,
...                    min_samples_split=2,
...                    n_estimators=40)
>>> cuml_model.fit(X,y)
RandomForestRegressor()
>>> cuml_score = cuml_model.score(X,y)
>>> print("R2 score of cuml : ", cuml_score)
R2 score of cuml :  0.9076250195503235

fit(X, y, *, convert_dtype=True) → RandomForestRegressor[source]#

Perform Random Forest Regression on the input data

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, predict_model='deprecated') → CumlArray[source]#

Predicts the values for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. To infer on CPU use model.as_fil to get a FIL instance which may then be used to perform inference on both CPU and GPU.

Returns:

ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

score(X, y, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, predict_model='deprecated')[source]#

Calculates the accuracy metric score of the model for X. In the 0.16 release, the default scoring metric was changed from mean squared error to r-squared.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool (default = True): When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. To infer on CPU use model.as_fil to get a FIL instance which may then be used to perform inference on both CPU and GPU.

Returns:

mean_square_errorfloat or
median_abs_errorfloat or
mean_abs_errorfloat

Forest Inferencing#

class cuml.ForestInference(*, treelite_model=None, handle=None, output_type=None, verbose=False, is_classifier=False, layout='depth_first', default_chunk_size=None, align_bytes=None, precision='single', device_id=None)#

ForestInference provides accelerated inference for forest models on both CPU and GPU.

Performance Tuning FIL offers a number of hyperparameters that can be tuned to obtain optimal performance for a given model, hardware, and batch size. The easiest way to optimize these parameters is using the automated optimize method, which will find the optimum for an indicated batch size. For some use cases, manual adjustment of these parameters is preferred, so available performance hyperparameters are described in detail below.

To obtain optimal performance with this implementation of FIL, the single most important value is the chunk_size parameter passed to the predict method. Essentially, chunk_size determines how many rows to evaluate together at once from a single batch. Larger values reduce global memory accesses on GPU and cache misses on CPU, but smaller values allow for finer-grained parallelism, improving usage of available processing power. The optimal value for this parameter is hard to predict a priori, but in general larger batch sizes benefit from larger chunk sizes and smaller batch sizes benefit from smaller chunk sizes. Having a chunk size larger than the batch size is never optimal.

To determine the optimal chunk size on GPU, test powers of 2 from 1 to 32. Values above 32 and values which are not powers of 2 are not supported.

To determine the optimal chunk size on CPU, test powers of 2 from 1 to 512. Values above 512 are supported, but RAPIDS developers have not yet seen a case where they yield improved performance.

After chunk size, the most important performance parameter is layout, also described below. Testing available layouts is recommended to optimize performance, but the impact is likely to be substantially less than optimizing chunk_size. There is no universal rule for predicting which layout will produce the best performance. On both GPU and CPU, the depth_first layout can improve performance by increasing cache hits during tree traversal. This tends to be the strongest effect for most use cases, so depth_first is used as the default value.

align_bytes is the final performance parameter. This parameter allows trees to be padded with empty nodes until their total in-memory size is a multiple of the given value. In general, if a non-default value is used, it should either be 0 or the cache line byte size for the device being used for execution (64 for CPU or 128 for GPU). If left unpadded, forest data remains more compact in memory, which can improve the frequency of cache hits. On the other hand, padding to the size of the cache line ensures that trees begin on cache line boundaries. It is difficult to predict for any given model which effect will be the greater determinant of performance. If left at the default value of None, trees will be unpadded for GPU execution and padded to 64 bytes for CPU execution. This value has no effect for the layered layout, since trees in this layout overlap in memory.

Parameters:

treelite_modeltreelite.Model: The model to be used for inference. This can be trained with XGBoost, LightGBM, cuML, Scikit-Learn, or any other forest model framework so long as it can be loaded into a treelite.Model object (See https://treelite.readthedocs.io/en/latest/treelite-api.html).
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
layout{‘breadth_first’, ‘depth_first’, ‘layered’}, default=’depth_first’: The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that each layout be tested with realistic batch sizes to determine the optimal value.
align_bytesint or None, default=None: Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.
precision{‘single’, ‘double’, None}, default=’single’: Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from FIL and the original training framework is of paramount importance.
device_idint or None, default=None: For GPU execution, the device on which to load and execute this model. If set to None, use the currently active device. For CPU execution, this value is currently ignored.

Attributes:

align_bytes: ForestInference.align_bytes(self)
cpu_forest: ForestInference.cpu_forest(self)
device_id: ForestInference.device_id(self)
forest: ForestInference.forest(self)
gpu_forest: ForestInference.gpu_forest(self)
is_classifier: ForestInference.is_classifier(self)
layout: ForestInference.layout(self)
precision: ForestInference.precision(self)
treelite_model: ForestInference.treelite_model(self)

Methods

`apply`(self, X, *[, preds, chunk_size])	Output the ID of the leaf node for each tree.
`load`(cls, path, *[, is_classifier, ...])	Load a model into FIL from a serialized model file.
`load_from_sklearn`(cls, skl_model, *[, ...])	Load a Scikit-Learn forest model to FIL
`load_from_treelite_model`(cls, tl_model, *[, ...])	Load a Treelite model to FIL
`num_outputs`(self)
`num_trees`(self)
`optimize`(self, *[, data, batch_size, ...])	Find the optimal layout and chunk size for this model
`predict`(self, X, *[, preds, chunk_size, ...])	For classification models, predict the class for each row.
`predict_per_tree`(self, X, *[, preds, chunk_size])	Output prediction of each tree.
`predict_proba`(self, X, *[, preds, chunk_size])	Predict the class probabilities for each row in X.
`set_params`(self, **params)

property align_bytes#

apply(self, X, *, preds=None, chunk_size=None) → CumlArray[source]#

Output the ID of the leaf node for each tree.

Parameters:

X: The input data of shape Rows X Features. This can be a numpy array, cupy array, Pandas/cuDF Dataframe or any other array type accepted by cuML. FIL is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with e.g. the set_fil_device_type context manager), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.
preds: If non-None, outputs will be written in-place to this array. Therefore, if given, this should be a C-major array of shape n_rows * n_trees. Classes with a datatype (float/double) corresponding to the precision of the model. If None, an output array of the correct shape and type will be allocated and returned.
chunk_sizeint: The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

property cpu_forest#: The underlying FIL forest model loaded in CPU-accessible memory

property device_id#

property forest#: The underlying FIL forest model loaded in memory compatible with the current global device_type setting

get_params(deep=True)[source]#: Returns a dict of all params owned by this class. If the child class has appropriately overridden the _get_param_names method and does not need anything other than what is there in this method, then it doesn’t have to override this method

property gpu_forest#: The underlying FIL forest model loaded in GPU-accessible memory

property is_classifier#

property layout#

classmethod load(cls, path, *, is_classifier=False, precision='single', model_type=None, output_type=None, verbose=False, default_chunk_size=None, align_bytes=None, layout='depth_first', device_id=0, handle=None)[source]#

Load a model into FIL from a serialized model file.

Parameters:

pathstr: The path to the serialized model file. This can be an XGBoost binary or JSON file, a LightGBM text file, or a Treelite checkpoint file. If the model_type parameter is not passed, an attempt will be made to load the file based on its extension.
is_classifierboolean, default=False: True for classification models, False for regressors
precision{‘single’, ‘double’, None}, default=’single’: Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from FIL and the original training framework is of paramount importance.
model_type{‘xgboost_ubj’, ‘xgboost_json’, ‘xgboost’, ‘lightgbm’,: ‘treelite_checkpoint’, None }, default=None The serialization format for the model file. If None, a best-effort guess will be made based on the file extension.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
default_chunk_sizeint or None, default=None: If set, predict calls without a specified chunk size will use this default value.
align_bytesint or None, default=None: Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.
layout{‘breadth_first’, ‘depth_first’, ‘layered’}, default=’depth_first’: The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.
device_idint, default=0: For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.
handlepylibraft.common.handle or None: For GPU execution, the RAFT handle containing the stream or stream pool to use during loading and inference.

classmethod load_from_sklearn(cls, skl_model, *, is_classifier=False, precision='single', model_type=None, output_type=None, verbose=False, default_chunk_size=None, align_bytes=None, layout='depth_first', device_id=0, handle=None)[source]#

Load a Scikit-Learn forest model to FIL

Parameters:

skl_model: The Scikit-Learn forest model to load.
is_classifierboolean, default=False: True for classification models, False for regressors
precision{‘single’, ‘double’, None}, default=’single’: Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from FIL and the original training framework is of paramount importance.
model_type{‘xgboost’, ‘xgboost_json’, ‘lightgbm’,: ‘treelite_checkpoint’, None }, default=None The serialization format for the model file. If None, a best-effort guess will be made based on the file extension.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
default_chunk_sizeint or None, default=None: If set, predict calls without a specified chunk size will use this default value.
align_bytesint or None, default=None: Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.
layout{‘breadth_first’, ‘depth_first’, ‘layered’}, default=’depth_first’: The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.
mem_type{‘device’, ‘host’, None}, default=’single’: The memory type to use for initially loading the model. If None, the current global memory type setting will be used. If the model is loaded with one memory type and inference is later requested with an incompatible device (e.g. device memory and CPU execution), the model will be lazily loaded to the correct location at that time. In general, it should not be necessary to set this parameter directly (rely instead on the set_fil_device_type context manager), but it can be a useful convenience for some hyperoptimization pipelines.
device_idint, default=0: For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.
handlepylibraft.common.handle or None: For GPU execution, the RAFT handle containing the stream or stream pool to use during loading and inference.

classmethod load_from_treelite_model(cls, tl_model, *, is_classifier=False, precision='single', model_type=None, output_type=None, verbose=False, default_chunk_size=None, align_bytes=None, layout='depth_first', device_id=0, handle=None)[source]#

Load a Treelite model to FIL

Parameters:

tl_modeltreelite.Model: The Treelite model to load.
is_classifierboolean, default=False: True for classification models, False for regressors
precision{‘single’, ‘double’, None}, default=’single’: Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from FIL and the original training framework is of paramount importance.
model_type{‘xgboost’, ‘xgboost_json’, ‘lightgbm’,: ‘treelite_checkpoint’, None }, default=None The serialization format for the model file. If None, a best-effort guess will be made based on the file extension.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
default_chunk_sizeint or None, default=None: If set, predict calls without a specified chunk size will use this default value.
align_bytesint or None, default=None: Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.
layout{‘breadth_first’, ‘depth_first’, ‘layered’}, default=’depth_first’: The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.
mem_type{‘device’, ‘host’, None}, default=’single’: The memory type to use for initially loading the model. If None, the current global memory type setting will be used. If the model is loaded with one memory type and inference is later requested with an incompatible device (e.g. device memory and CPU execution), the model will be lazily loaded to the correct location at that time. In general, it should not be necessary to set this parameter directly (rely instead on the set_fil_device_type context manager), but it can be a useful convenience for some hyperoptimization pipelines.
device_idint, default=0: For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.
handlepylibraft.common.handle or None: For GPU execution, the RAFT handle containing the stream or stream pool to use during loading and inference.

num_outputs(self)[source]#

num_trees(self)[source]#

optimize(self, *, data=None, batch_size=1024, unique_batches=10, timeout=0.2, predict_method='predict', max_chunk_size=None, seed=0)[source]#

Find the optimal layout and chunk size for this model

The optimal value for layout and chunk size depends on the model, batch size, and available hardware. In order to get the most realistic performance distribution, example data can be provided. If it is not, random data will be generated based on the indicated batch size. After finding the optimal layout, the model will be reloaded if necessary. The optimal chunk size will be used to set the default chunk size used if none is passed to the predict call.

Parameters:

data: Example data either of shape unique_batches x batch size x features or batch_size x features or None. If None, random data will be generated instead.
batch_sizeint: If example data is not provided, random data with this many rows per batch will be used.
unique_batchesint: The number of unique batches to generate if random data are used. Increasing this number decreases the chance that the optimal configuration will be skewed by a single batch with unusual performance characteristics.
timeoutfloat: Time in seconds to target for optimization. The optimization loop will be repeatedly run a number of times increasing in the sequence 1, 2, 5, 10, 20, 50, … until the time taken is at least the given value. Note that for very large batch sizes and large models, the total elapsed time may exceed this timeout; it is a soft target for elapsed time. Setting the timeout to zero will run through the indicated number of unique batches exactly once. Defaults to 0.2s.
predict_methodstr: If desired, optimization can occur over one of the prediction method variants (e.g. “predict_per_tree”) rather than the default predict method. To do so, pass the name of the method here.
max_chunk_sizeint or None: The maximum chunk size to explore during optimization. If not set, a value will be picked based on the current device type. Setting this to a lower value will reduce the optimization search time but may not result in optimal performance.
seedint: The random seed used for generating example data if none is provided.

property precision#

predict(self, X, *, preds=None, chunk_size=None, threshold=None) → CumlArray[source]#

For classification models, predict the class for each row. For regression models, predict the output for each row.

Parameters:

X: The input data of shape Rows X Features. This can be a numpy array, cupy array, Pandas/cuDF Dataframe or any other array type accepted by cuML. FIL is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with e.g. the set_fil_device_type context manager), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.
preds: If non-None, outputs will be written in-place to this array. Therefore, if given, this should be a C-major array of shape Rows x 1 with a datatype (float/double) corresponding to the precision of the model. If None, an output array of the correct shape and type will be allocated and returned. For classifiers, in-place prediction offers no performance or memory benefit. For regressors, in-place prediction offers both a performance and memory benefit.
chunk_sizeint: The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.
thresholdfloat: For binary classifiers, output probabilities above this threshold will be considered positive detections. If None, a threshold of 0.5 will be used for binary classifiers. For multiclass classifiers, the highest probability class is chosen regardless of threshold.

predict_per_tree(self, X, *, preds=None, chunk_size=None) → CumlArray[source]#

Output prediction of each tree. This function computes one or more margin scores per tree.

Parameters:

X: The input data of shape Rows X Features. This can be a numpy array, cupy array, Pandas/cuDF Dataframe or any other array type accepted by cuML. FIL is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with e.g. the set_fil_device_type context manager), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.
preds: If non-None, outputs will be written in-place to this array. Therefore, if given, this should be a C-major array of shape n_rows * n_trees * n_outputs (if vector leaf is used) or shape n_rows * n_trees (if scalar leaf is used). Classes with a datatype (float/double) corresponding to the precision of the model. If None, an output array of the correct shape and type will be allocated and returned.
chunk_sizeint: The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict_proba(self, X, *, preds=None, chunk_size=None) → CumlArray[source]#

Predict the class probabilities for each row in X.

Parameters:

X: The input data of shape Rows X Features. This can be a numpy array, cupy array, Pandas/cuDF Dataframe or any other array type accepted by cuML. FIL is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with e.g. the set_fil_device_type context manager), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.
preds: If non-None, outputs will be written in-place to this array. Therefore, if given, this should be a C-major array of shape Rows x Classes with a datatype (float/double) corresponding to the precision of the model. If None, an output array of the correct shape and type will be allocated and returned.
chunk_sizeint: The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

set_params(self, **params)[source]#

property treelite_model#

Coordinate Descent#

class cuml.CD(*, loss='squared_loss', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, shuffle=True, handle=None, output_type=None, verbose=False)#

Coordinate Descent (CD) is a very common optimization algorithm that minimizes along coordinate directions to find the minimum of a function.

cuML’s CD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset.algorithm The CD algorithm currently works with linear regression and ridge, lasso, and elastic-net penalties.

Parameters:

loss‘squared_loss’: Only ‘squared_loss’ is supported right now. ‘squared_loss’ uses linear regression in its predict step.
alpha: float (default = 0.0001): The constant value which decides the degree of regularization. ‘alpha = 0’ is equivalent to an ordinary least square, solved by the LinearRegression object.
l1_ratio: float (default = 0.15): The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
fit_interceptboolean (default = True): If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalizeboolean (default = False): Whether to normalize the data or not.
max_iterint (default = 1000): The number of times the model should iterate through the entire dataset during training
tolfloat (default = 1e-3): The tolerance for the optimization: if the updates are smaller than tol, solver stops.
shuffleboolean (default = True): If set to ‘True’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘True’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

coef_

Methods

`fit`(self, X, y[, convert_dtype, sample_weight])	Fit the model with X and y.
`predict`(self, X[, convert_dtype])	Predicts the y for X.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.solvers import CD as cumlCD

>>> cd = cumlCD(alpha=0.0)

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)

>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))

>>> cd.fit(X,y)
CD()
>>> print(cd.coef_)
0 1.001...
1 1.998...
dtype: float32
>>> print(cd.intercept_)
3.00...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)

>>> preds = cd.predict(X_new)
>>> print(preds)
0 15.997...
1 14.995...
dtype: float32

fit(self, X, y, convert_dtype=True, sample_weight=None) → 'CD'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

predict(self, X, convert_dtype=True) → CumlArray[source]#

Predicts the y for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Quasi-Newton#

class cuml.QN(*, loss='sigmoid', fit_intercept=True, l1_strength=0.0, l2_strength=0.0, max_iter=1000, tol=0.0001, delta=None, linesearch_max_iter=50, lbfgs_memory=5, verbose=False, handle=None, output_type=None, warm_start=False, penalty_normalized=True)#

Quasi-Newton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.

Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:

Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

Limited Memory BFGS (L-BFGS) otherwise.

cuML’s QN class can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).

Parameters:

loss: ‘sigmoid’, ‘softmax’, ‘l1’, ‘l2’, ‘svc_l1’, ‘svc_l2’, ‘svr_l1’, ‘svr_l2’ (default = ‘sigmoid’).

‘sigmoid’ loss used for single class logistic regression; ‘softmax’ loss used for multiclass logistic regression; ‘l1’/’l2’ loss used for regression.

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_strength: float (default = 0.0)

l1 regularization strength (if non-zero, will run OWL-QN, else L-BFGS). Use penalty_normalized to control whether the solver divides this by the sample size.

l2_strength: float (default = 0.0)

l2 regularization strength. Use penalty_normalized to control whether the solver divides this by the sample size.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

tol: float (default = 1e-4)

The training process will stop if

norm(current_loss_grad) <= tol * max(current_loss, tol).

This differs slightly from the gtol-controlled stopping condition in scipy.optimize.minimize(method=’L-BFGS-B’):

norm(current_loss_projected_grad) <= gtol.

Note, sklearn.LogisticRegression() uses the sum of softmax/logistic loss over the input data, whereas cuML uses the average. As a result, Scikit-learn’s loss is usually sample_size times larger than cuML’s. To account for the differences you may divide the tol by the sample size; this would ensure that the cuML solver does not stop earlier than the Scikit-learn solver.

delta: Optional[float] (default = None)

The training process will stop if

abs(current_loss - previous_loss) <= delta * max(current_loss, tol).

When None, it’s set to tol * 0.01; when 0, the check is disabled. Given the current step k, parameter previous_loss here is the loss at the step k - p, where p is a small positive integer set internally.

Note, this parameter corresponds to ftol in scipy.optimize.minimize(method=’L-BFGS-B’), which is set by default to a minuscule 2.2e-9 and is not exposed in sklearn.LogisticRegression(). This condition is meant to protect the solver against doing vanishingly small linesearch steps or zigzagging. You may choose to set delta = 0 to make sure the cuML solver does not stop earlier than the Scikit-learn solver.

linesearch_max_iter: int (default = 50)

Max number of linesearch iterations per outer iteration of the algorithm.

lbfgs_memory: int (default = 5)

Rank of the lbfgs inverse-Hessian approximation. Method will use O(lbfgs_memory * D) memory.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

penalty_normalizedbool, default=True

When set to True, l1 and l2 parameters are divided by the sample size. This flag can be used to achieve a behavior compatible with other implementations, such as sklearn’s.

Attributes:

coef_array, shape (n_classes, n_features): QN.coef_(self)
intercept_array (n_classes, 1): The independent term. If fit_intercept is False, will be 0.

Methods

`fit`(self, X, y[, sample_weight, convert_dtype])	Fit the model with X and y.
`get_num_classes`(self, _num_classes_dim)	Retrieves the number of classes from the classes dimension in the coefficients.
`predict`(self, X[, convert_dtype])	Predicts the y for X.
`score`(self, X, y)

Notes

This class contains implementations of two popular Quasi-Newton methods:

Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) [Nocedal, Wright - Numerical Optimization (1999)]

Orthant-wise limited-memory quasi-newton (OWL-QN) [Andrew, Gao - ICML 2007]

Examples

>>> import cudf
>>> import cupy as cp

>>> # Both import methods supported
>>> # from cuml import QN
>>> from cuml.solvers import QN

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)
>>> y = cudf.Series(cp.array([0.0, 0.0, 1.0, 1.0], dtype=cp.float32) )

>>> solver = QN()
>>> solver.fit(X,y)
QN()

>>> # Note: for now, the coefficients also include the intercept in the
>>> # last position if fit_intercept=True
>>> print(solver.coef_)
0   37.371...
1   0.949...
dtype: float32
>>> print(solver.intercept_)
0   -57.738...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([1,5], dtype=cp.float32)
>>> X_new['col2'] = cp.array([2,5], dtype=cp.float32)
>>> preds = solver.predict(X_new)
>>> print(preds)
0    0.0
1    1.0
dtype: float32

property coef_#

fit(self, X, y, sample_weight=None, convert_dtype=True) → 'QN'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_num_classes(self, _num_classes_dim)[source]#: Retrieves the number of classes from the classes dimension in the coefficients.

predict(self, X, convert_dtype=True) → CumlArray[source]#

Predicts the y for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

score(self, X, y)[source]#

Support Vector Machines#

class cuml.svm.SVR(Epsilon Support Vector Regression)#

Construct an SVC classifier for training and predictions.

Parameters:

handlecuml.Handle

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:

‘auto’: gamma will be set to 1 / n_features
‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only significant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

epsilon: float (default = 0.1)

epsilon parameter of the epsiron-SVR model. There is no penalty associated to points that are predicted within the epsilon-tube around the target values.

cache_sizefloat (default = 1024.0)

Size of the kernel cache during training in MiB. Increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be significant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.

max_iterint (default = -1)

Limit the number of outer iterations in the solver. If -1 (default) then max_iter=100*n_samples

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

n_support_int: The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see Issue #956)
support_int, shape = [n_support]: Device array of support vector indices
support_vectors_float, shape [n_support, n_cols]: Device array of support vectors
dual_coef_float, shape = [1, n_support]: Device array of coefficients for support vectors
intercept_int: SVMBase.intercept_(self)
fit_status_int: 0 if SVM is correctly fitted
coef_float, shape [1, n_cols]: SVMBase.coef_(self)

Methods

`fit`(self, X, y[, sample_weight, convert_dtype])	Fit the model with X and y.
`predict`(self, X, *[, convert_dtype])	Predicts the values for X.

Notes

For additional docs, see Scikit-learn’s SVR.

The solver uses the SMO method to fit the regressor. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]

References

[1]

J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)

[2]

Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal of Machine Learning Research, 19, 1-5 (2018)

Examples

>>> import cupy as cp
>>> from cuml.svm import SVR
>>> X = cp.array([[1], [2], [3], [4], [5]], dtype=cp.float32)
>>> y = cp.array([1.1, 4, 5, 3.9, 1.], dtype = cp.float32)
>>> reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1)
>>> reg.fit(X, y)
SVR()
>>> print("Predicted values:", reg.predict(X))
Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]

fit(self, X, y, sample_weight=None, *, convert_dtype=True) → 'SVR'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Predicts the values for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class cuml.svm.SVC(C-Support Vector Classification)#

Construct an SVC classifier for training and predictions.

Parameters:

handlecuml.Handle

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:

‘auto’: gamma will be set to 1 / n_features
‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only significant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

cache_sizefloat (default = 1024.0)

class_weightdict or string (default=None)

Weights to modify the parameter C for class i to class_weight[i]*C. The string ‘balanced’ is also accepted, in which case class_weight[i] = n_samples / (n_classes * n_samples_of_class[i])

max_iterint (default = -1)

Limit the number of outer iterations in the solver. If -1 (default) then max_iter=100*n_samples

decision_function_shapestr (‘ovo’ or ‘ovr’, default ‘ovo’)

Multiclass classification strategy. 'ovo' uses OneVsOneClassifier while 'ovr' selects OneVsRestClassifier

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

probability: bool (default = False)

Enable or disable probability estimates.

random_state: int (default = None)

Seed for random number generator (used only when probability = True). Currently this argument is not used and a warning will be printed if the user provides it.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

n_support_int: The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see rapidsai/cuml#956 )
support_int, shape = (n_support): SVC.support_(self)
support_vectors_float, shape (n_support, n_cols): Device array of support vectors
dual_coef_float, shape = (1, n_support): Device array of coefficients for support vectors
intercept_float: SVC.intercept_(self)
fit_status_int: 0 if SVM is correctly fitted
coef_float, shape (1, n_cols): SVMBase.coef_(self)
classes_shape (n_classes_,): SVC.classes_(self)
n_classes_int: Number of classes

Methods

`decision_function`(self, X)	Calculates the decision function values for X.
`fit`(self, X, y[, sample_weight, convert_dtype])	Fit the model with X and y.
`predict`(self, X, *[, convert_dtype])	Predicts the class labels for X.
`predict_log_proba`(self, X)	Predicts the log probabilities for X (returns log(predict_proba(x)).
`predict_proba`(self, X, *[, log])	Predicts the class probabilities for X.

Notes

The solver uses the SMO method to fit the classifier. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2].

For additional docs, see scikitlearn’s SVC.

References

[1]

[2]

Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal of Machine Learning Research, 19, 1-5 (2018)

Examples

>>> import cupy as cp
>>> from cuml.svm import SVC
>>> X = cp.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]],
...              dtype=cp.float32);
>>> y = cp.array([-1, -1, 1, -1, 1, 1], dtype=cp.float32)
>>> clf = SVC(kernel='poly', degree=2, gamma='auto', C=1)
>>> clf.fit(X, y)
SVC()
>>> print("Predicted labels:", clf.predict(X))
Predicted labels: [-1. -1.  1. -1.  1.  1.]

decision_function(self, X) → CumlArray[source]#

Calculates the decision function values for X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Decision function values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(self, X, y, sample_weight=None, *, convert_dtype=True) → 'SVC'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Predicts the class labels for X. The returned y values are the class labels associated to sign(decision_function(X)). Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba(self, X) → CumlArray[source]#

Predicts the log probabilities for X (returns log(predict_proba(x)).

The model has to be trained with probability=True to use this method. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Log of predicted probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, *, log=False) → CumlArray[source]#

Predicts the class probabilities for X.

The model has to be trained with probability=True to use this method.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

lean (default = False)

Whether to return log probabilities.

Returns

——-

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Predicted probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class cuml.svm.LinearSVC(Support Vector Classification with the linear kernel)[source]#

Construct a linear SVM classifier for training and predictions.

Parameters:

handlecuml.Handle

penalty{‘l1’, ‘l2’} (default = ‘l2’)

The regularization term of the target function.

loss{‘hinge’, ‘squared_hinge’} (default = ‘squared_hinge’)

The loss term of the target function.

fit_interceptbool (default = True)

Whether to fit the bias term. Set to False if you expect that the data is already centered.

penalized_interceptbool (default = False)

When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function. Enabling this feature forces an extra copying the input data X.

max_iterint (default = 1000)

Maximum number of iterations for the underlying solver.

linesearch_max_iterint (default = 100)

Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.

lbfgs_memoryint (default = 5)

Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).

class_weightdict or string (default=None)

Weights to modify the parameter C for class i to class_weight[i]*C. The string ‘balanced’ is also accepted, in which case class_weight[i] = n_samples / (n_classes * n_samples_of_class[i])

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Cfloat (default = 1.0)

The constant scaling factor of the loss term in the target formula: F(X, y) = penalty(X) + C * loss(X, y).

grad_tolfloat (default = 0.0001)

The threshold on the gradient for the underlying QN solver.

change_tolfloat (default = 1e-05)

The threshold on the function change for the underlying QN solver.

tolOptional[float] (default = None)

Tolerance for the stopping criterion. This is a helper transient parameter that, when present, sets both grad_tol and change_tol to the same value. When any of the two ***_tol parameters are passed as well, they take the precedence.

probability: bool (default = False)

Enable or disable probability estimates.

multi_class{currently, only ‘ovr’} (default = ‘ovr’)

Multiclass classification strategy. 'ovo' uses OneVsOneClassifier while 'ovr' selects OneVsRestClassifier

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

intercept_float, shape (n_classes_,): The constant in the decision function
coef_float, shape (n_classes_, n_cols): The vectors defining the hyperplanes that separate the classes.
classes_float, shape (n_classes_,): Array of class labels.
probScale_float, shape (n_classes_, 2): Probability calibration constants (for the probabolistic output).
n_classes_int: LinearSVM.n_classes_(self) -> int

Methods

`fit`(self, X, y[, sample_weight, convert_dtype])
`predict`(self, X, *[, convert_dtype])

Notes

The model uses the quasi-newton (QN) solver to find the solution in the primal space. Thus, in contrast to generic SVC model, it does not compute the support coefficients/vectors.

Check the solver’s documentation for more details Quasi-Newton (L-BFGS/OWL-QN).

For additional docs, see scikitlearn’s LinearSVC.

Examples

>>> import cupy as cp
>>> from cuml.svm import LinearSVC
>>> X = cp.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]],
...              dtype=cp.float32);
>>> y = cp.array([0, 0, 1, 0, 1, 1], dtype=cp.float32)
>>> clf = LinearSVC(loss='squared_hinge', penalty='l1', C=1)
>>> clf.fit(X, y)
LinearSVC()
>>> print("Predicted labels:", clf.predict(X))
Predicted labels: [0 0 1 0 1 1]

fit(self, X, y, sample_weight=None, *, convert_dtype=True) → 'LinearSVM'[source]#

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

class cuml.svm.LinearSVR(Support Vector Regression with the linear kernel)[source]#

Construct a linear SVM regressor for training and predictions.

Parameters:

handlecuml.Handle

penalty{‘l1’, ‘l2’} (default = ‘l2’)

The regularization term of the target function.

loss{‘epsilon_insensitive’, ‘squared_epsilon_insensitive’} (default = ‘epsilon_insensitive’)

The loss term of the target function.

fit_interceptbool (default = True)

Whether to fit the bias term. Set to False if you expect that the data is already centered.

penalized_interceptbool (default = False)

max_iterint (default = 1000)

Maximum number of iterations for the underlying solver.

linesearch_max_iterint (default = 100)

Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.

lbfgs_memoryint (default = 5)

Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Cfloat (default = 1.0)

The constant scaling factor of the loss term in the target formula: F(X, y) = penalty(X) + C * loss(X, y).

grad_tolfloat (default = 0.0001)

The threshold on the gradient for the underlying QN solver.

change_tolfloat (default = 1e-05)

The threshold on the function change for the underlying QN solver.

tolOptional[float] (default = None)

epsilonfloat (default = 0.0)

The epsilon-sensitivity parameter for the SVR loss function.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

intercept_float, shape (1,): The constant in the decision function
coef_float, shape (1, n_cols): The coefficients of the linear decision function.

Notes

The model uses the quasi-newton (QN) solver to find the solution in the primal space. Thus, in contrast to generic SVC model, it does not compute the support coefficients/vectors.

Check the solver’s documentation for more details Quasi-Newton (L-BFGS/OWL-QN).

For additional docs, see scikitlearn’s LinearSVR.

Examples

>>> import cupy as cp
>>> from cuml.svm import LinearSVR
>>> X = cp.array([[1], [2], [3], [4], [5]], dtype=cp.float32)
>>> y = cp.array([1.1, 4, 5, 3.9, 8.], dtype=cp.float32)
>>> reg = LinearSVR(loss='epsilon_insensitive', C=10,
...                 epsilon=0.1, verbose=0)
>>> reg.fit(X, y)
LinearSVR()
>>> print("Predicted values:", reg.predict(X))
Predicted labels: [1.8993504 3.3995128 4.899675  6.399837  7.899999]

Nearest Neighbors Classification#

class cuml.neighbors.KNeighborsClassifier(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters:

n_neighborsint (default=5): Default number of neighbors to query
algorithmstring (default=’auto’): The query algorithm to use. Currently, only ‘brute’ is supported.
metricstring (default=’euclidean’).: Distance metric to use.
weightsstring (default=’uniform’): Sample weights to use. Currently, only the uniform strategy is supported.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

classes_: KNeighborsClassifier.classes_(self)
outputs_2d_: KNeighborsClassifier.outputs_2d_(self)
y

Methods

`fit`(self, X, y, *[, convert_dtype])	Fit a GPU index for k-nearest neighbors classifier model.
`predict`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors classifier to predict the labels for X Parameters ----------
`predict_proba`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors classifier to predict the label probabilities for X Parameters ----------

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsClassifier
>>> from cuml.datasets import make_blobs
>>> from cuml.model_selection import train_test_split

>>> X, y = make_blobs(n_samples=100, centers=5,
...                   n_features=10, random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsClassifier(n_neighbors=10)

>>> knn.fit(X_train, y_train)
KNeighborsClassifier()
>>> knn.predict(X_test)
array([1., 2., 2., 3., 4., 2., 4., 4., 2., 3., 1., 4., 3., 1., 3., 4., 3., # noqa: E501
    4., 1., 3.], dtype=float32)

property classes_

fit(self, X, y, *, convert_dtype=True) → 'KNeighborsClassifier'[source]

Fit a GPU index for k-nearest neighbors classifier model. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

property outputs_2d_: Whether the output is 2d

predict(self, X, *, convert_dtype=True) → CumlArray[source]

Use the trained k-nearest neighbors classifier to predict the labels for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels predicted

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, *, convert_dtype=True) → CumlArray | list[CumlArray][source]

Use the trained k-nearest neighbors classifier to predict the label probabilities for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Nearest Neighbors Regression#

class cuml.neighbors.KNeighborsRegressor(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.

Parameters:

n_neighborsint (default=5)

Default number of neighbors to query

algorithmstring (default=’auto’)

The query algorithm to use. Valid options are:

'auto': to automatically select brute-force or random ball cover based on data shape and metric
'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports 2d Euclidean and Haversine.
'brute': for brute-force, slow but produces exact results
'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only
'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

y

Methods

`fit`(self, X, y, *[, convert_dtype])	Fit a GPU index for k-nearest neighbors regression model.
`predict`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors regression model to predict the labels for X Parameters ----------

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsRegressor
>>> from cuml.datasets import make_regression
>>> from cuml.model_selection import train_test_split

>>> X, y = make_regression(n_samples=100, n_features=10,
...                        random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...   X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsRegressor(n_neighbors=10)
>>> knn.fit(X_train, y_train)
KNeighborsRegressor()
>>> knn.predict(X_test)
array([ 14.770798  ,  51.8834    ,  66.15657   ,  46.978275  ,
    21.589611  , -14.519918  , -60.25534   , -20.856869  ,
    29.869623  , -34.83317   ,   0.45447388, 120.39675   ,
    109.94834   ,  63.57794   , -17.956171  ,  78.77663   ,
    30.412262  ,  32.575233  ,  74.72834   , 122.276855  ],
dtype=float32)

fit(self, X, y, *, convert_dtype=True) → 'KNeighborsRegressor'[source]

Fit a GPU index for k-nearest neighbors regression model. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

predict(self, X, *, convert_dtype=True) → CumlArray[source]

Use the trained k-nearest neighbors regression model to predict the labels for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Kernel Ridge Regression#

class cuml.KernelRidge(*, alpha=1, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None, output_type=None, handle=None, verbose=False)[source]#

Kernel ridge regression (KRR) performs l2 regularised ridge regression using the kernel trick. The kernel trick allows the estimator to learn a linear function in the space induced by the kernel. This may be a non-linear function in the original feature space (when a non-linear kernel is used). This estimator supports multi-output regression (when y is 2 dimensional). See the sklearn user guide for more information.

Parameters:

alphafloat or array-like of shape (n_targets,), default=1.0: Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. If an array is passed, penalties are assumed to be specific to the targets.
kernelstr or callable, default=”linear”: Kernel mapping used internally. This parameter is directly passed to pairwise_kernel. If kernel is a string, it must be one of the metrics in cuml.metrics.PAIRWISE_KERNEL_FUNCTIONS or “precomputed”. If kernel is “precomputed”, X is assumed to be a kernel matrix. kernel may be a callable numba device function. If so, is called on each pair of instances (rows) and the resulting value recorded.
gammafloat, default=None: Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.
degreefloat, default=3: Degree of the polynomial kernel. Ignored by other kernels.
coef0float, default=1: Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.
kernel_paramsmapping of str to any, default=None: Additional parameters (keyword arguments) for kernel function passed as callable object.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

dual_coef_ndarray of shape (n_samples,) or (n_samples, n_targets): Representation of weight vector(s) in kernel space
X_fit_ndarray of shape (n_samples, n_features): Training data, which is also required for prediction. If kernel == “precomputed” this is instead the precomputed training matrix, of shape (n_samples, n_samples).

Methods

`fit`(X, y[, sample_weight, convert_dtype])
`predict`(X)	Predict using the kernel ridge model.

Examples

>>> import cupy as cp
>>> from cuml.kernel_ridge import KernelRidge
>>> from numba import cuda
>>> import math

>>> n_samples, n_features = 10, 5
>>> rng = cp.random.RandomState(0)
>>> y = rng.randn(n_samples)
>>> X = rng.randn(n_samples, n_features)

>>> model = KernelRidge(kernel="poly").fit(X, y)
>>> pred = model.predict(X)

>>> @cuda.jit(device=True)
... def custom_rbf_kernel(x, y, gamma=None):
...     if gamma is None:
...         gamma = 1.0 / len(x)
...     sum = 0.0
...     for i in range(len(x)):
...         sum += (x[i] - y[i]) ** 2
...     return math.exp(-gamma * sum)

>>> model = KernelRidge(kernel=custom_rbf_kernel,
...                     kernel_params={"gamma": 2.0}).fit(X, y)
>>> pred = model.predict(X)

fit(X, y, sample_weight=None, *, convert_dtype=True) → KernelRidge[source]#

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X)[source]#

Predict using the kernel ridge model.

Parameters:

Xarray-like of shape (n_samples, n_features): Samples. If kernel == “precomputed” this is instead a precomputed kernel matrix, shape = [n_samples, n_samples_fitted], where n_samples_fitted is the number of samples used in the fitting for this estimator.

Returns:

Carray of shape (n_samples,) or (n_samples, n_targets): Returns predicted values.

Clustering#

K-Means Clustering#

class cuml.KMeans(*, handle=None, n_clusters=8, max_iter=300, tol=0.0001, verbose=False, random_state=None, init='scalable-k-means++', n_init='auto', oversampling_factor=2.0, max_samples_per_batch=32768, output_type=None)#

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.

cuML’s KMeans expects an array-like object or cuDF DataFrame, and supports the scalable KMeans++ initialization method. This method is more stable than randomly selecting K points.

Parameters:

handlecuml.Handle

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat64 (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

random_stateint or None (default = None)

If you want results to be the same when you restart Python, select a state.

init{‘scalable-k-means++’, ‘k-means||’, ‘random’} or an ndarray (default = ‘scalable-k-means++’)

'scalable-k-means++' or 'k-means||': Uses fast and stable scalable kmeans++ initialization.
'random': Choose n_cluster observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init: ‘auto’ or int (default = ‘auto’)

Number of instances the k-means algorithm will be called with different seeds. The final results will be from the instance that produces lowest inertia out of n_init instances.

When n_init='auto', the number of runs depends on the value of init: 1 if using init='"k-means||" or init="scalable-k-means++"; 10 otherwise.

Added in version 25.02: Added ‘auto’ option for n_init.

Changed in version 25.04: Default value for n_init will change from 1 to 'auto' in version 25.04.

oversampling_factorfloat64 (default = 2.0)

The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.

max_samples_per_batchint (default = 32768)

The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

cluster_centers_array: The coordinates of the final clusters. This represents of “mean” of each data cluster.
labels_array: Which cluster each datapoint belongs to.

Methods

`fit`(self, X[, y, sample_weight, convert_dtype])	Compute k-means clustering with X.
`fit_predict`(self, X[, y, sample_weight])	Compute cluster centers and predict cluster index for each sample.
`fit_transform`(self, X[, y, sample_weight, ...])	Compute clustering and transform X to cluster-distance space.
`predict`(self, X, *[, convert_dtype])	Predict the closest cluster each sample in X belongs to.
`score`(self, X[, y, sample_weight, convert_dtype])	Opposite of the value of X on the K-means objective.
`transform`(self, X, *[, convert_dtype])	Transform X to a cluster-distance space.

Notes

KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.

Applications of KMeans

The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.

For additional docs, see scikitlearn’s Kmeans.

Examples

>>> # Both import methods supported
>>> from cuml import KMeans
>>> from cuml.cluster import KMeans
>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>>
>>> a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
...                dtype=np.float32)
>>> b = cudf.DataFrame(a)
>>> # Input:
>>> b
    0    1
0  1.0  1.0
1  1.0  2.0
2  3.0  2.0
3  4.0  3.0
>>>
>>> # Calling fit
>>> kmeans_float = KMeans(n_clusters=2, n_init="auto", random_state=1)
>>> kmeans_float.fit(b)
KMeans()
>>>
>>> # Labels:
>>> kmeans_float.labels_
0    0
1    0
2    1
3    1
dtype: int32
>>> # cluster_centers:
>>> kmeans_float.cluster_centers_
    0    1
0  1.0  1.5
1  3.5  2.5

fit(self, X, y=None, sample_weight=None, *, convert_dtype=True) → 'KMeans'[source]#

Compute k-means clustering with X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_predict(self, X, y=None, sample_weight=None) → CumlArray[source]#

Compute cluster centers and predict cluster index for each sample. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit_transform(self, X, y=None, sample_weight=None, *, convert_dtype=False) → CumlArray[source]#

Compute clustering and transform X to cluster-distance space. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = False): When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)

Transformed data

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Predict the closest cluster each sample in X belongs to. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

score(self, X, y=None, sample_weight=None, *, convert_dtype=True)[source]#

Opposite of the value of X on the K-means objective. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the score method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

scorefloat: Opposite of the value of X on the K-means objective.

transform(self, X, *, convert_dtype=True) → CumlArray[source]#

Transform X to a cluster-distance space. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)

Transformed data

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

DBSCAN#

class cuml.DBSCAN(*, eps=0.5, handle=None, min_samples=5, metric='euclidean', algorithm='brute', verbose=False, max_mbytes_per_batch=None, output_type=None, calc_core_sample_indices=True)#

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects an array-like object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters:

epsfloat (default = 0.5): The maximum distance between 2 points such they reside in the same neighborhood.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
min_samplesint (default = 5): The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
metric: {‘euclidean’, ‘cosine’, ‘precomputed’}, default = ‘euclidean’: The metric to use when calculating distances between points. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square. The input will be modified temporarily when cosine distance is used and the restored input matrix might not match completely due to numerical rounding.
algorithm: {‘brute’, ‘rbc’}, default = ‘brute’: The algorithm to be used by for nearest neighbor computations.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
max_mbytes_per_batch(optional) int64: Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
calc_core_sample_indices(optional) boolean (default = True): Indicates whether the indices of the core samples should be calculated. If True (the default), core_sample_indices_ and components_ will be computed and stored as fitted attributes. Set to False to avoid computing these attributes, removing a small amount of overhead.

Attributes:

labels_array-like or cuDF series: Which cluster each datapoint belongs to. Noisy samples are labeled as -1. Format depends on cuml global output type and estimator output_type.
core_sample_indices_array-like or cuDF series: The indices of the core samples. Only calculated if calc_core_sample_indices=True.
components_array-like or cuDF series: Copy of each core sample found by training. Only calculated if calc_core_sample_indices=True.

Methods

`fit`(self, X[, y, sample_weight, out_dtype, ...])	Perform DBSCAN clustering from features.
`fit_predict`(self, X[, y, sample_weight, ...])	Performs clustering on X and returns cluster labels.

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For additional docs, see scikitlearn’s DBSCAN.

Examples

>>> # Both import methods supported
>>> from cuml import DBSCAN
>>> from cuml.cluster import DBSCAN
>>>
>>> import cudf
>>> import numpy as np
>>>
>>> gdf_float = cudf.DataFrame()
>>> gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
>>> gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>> gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>>
>>> dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
>>> dbscan_float.fit(gdf_float)
DBSCAN()
>>> dbscan_float.labels_
0    0
1    1
2    2
dtype: int32

fit(self, X, y=None, sample_weight=None, *, out_dtype='int32', convert_dtype=True) → 'DBSCAN'[source]#

Perform DBSCAN clustering from features.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
sample_weightarray-like (device or host) shape = (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
e: dtype Determines the precision of the output labels array.: default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.
sample_weight: array-like of shape (n_samples,), default=None: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. default: None (which is equivalent to weight 1 for all samples).

fit_predict(self, X, y=None, sample_weight=None, *, out_dtype='int32') → CumlArray[source]#

Performs clustering on X and returns cluster labels.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features)

yarray-like (device or host) shape = (n_samples, 1)

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

e: dtype Determines the precision of the output labels array.

default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.

sample_weight: array-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. default: None (which is equivalent to weight 1 for all samples).

Returns

——-

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster labels

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Agglomerative Clustering#

class cuml.AgglomerativeClustering(*, n_clusters=2, metric=None, linkage='single', handle=None, verbose=False, connectivity='knn', n_neighbors=10, output_type=None)#

Agglomerative Clustering

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Parameters:

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

n_clustersint (default = 2)

The number of clusters to find.

metricstr, default=None

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, or “cosine”. If set to None then “euclidean” is used. If connectivity is “knn” only “euclidean” is accepted. .. versionadded:: 24.06

linkage{“single”}, default=”single”

Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observations. The algorithm will merge the pairs of clusters that minimize this criterion.

‘single’ uses the minimum of the distances between all observations of the two sets.

n_neighborsint (default = 15)

The number of neighbors to compute when connectivity = “knn”

connectivity{“pairwise”, “knn”}, (default = “knn”)

The type of connectivity matrix to compute.

‘pairwise’ will compute the entire fully-connected graph of pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.
‘knn’ will sparsify the fully-connected connectivity matrix to save memory and enable much larger inputs. “n_neighbors” will control the amount of memory used and the graph will be connected automatically in the event “n_neighbors” was not large enough to connect it.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

children_
labels_

Methods

`fit`(self, X[, y, convert_dtype])	Fit the hierarchical clustering from features.
`fit_predict`(self, X[, y])	Fit the hierarchical clustering from features and return cluster labels.

fit(self, X, y=None, *, convert_dtype=True) → 'AgglomerativeClustering'[source]#

Fit the hierarchical clustering from features. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_predict(self, X, y=None) → CumlArray[source]#

Fit the hierarchical clustering from features and return cluster labels. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

HDBSCAN#

class cuml.cluster.hdbscan.HDBSCAN(*, min_cluster_size=5, min_samples=None, cluster_selection_epsilon=0.0, max_cluster_size=0, metric='euclidean', alpha=1.0, p=None, cluster_selection_method='eom', allow_single_cluster=False, gen_min_span_tree=False, handle=None, verbose=False, output_type=None, prediction_data=False)#

HDBSCAN Clustering

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.

Parameters:

handlecuml.Handle

alphafloat, optional (default=1.0)

A distance scaling parameter as used in robust single linkage.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

min_cluster_sizeint, optional (default = 5)

The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.

min_samplesint, optional (default=None)

The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. If ‘None’, it defaults to the min_cluster_size.

cluster_selection_epsilonfloat, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

max_cluster_sizeint, optional (default=0)

A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

metricstring, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. Allowed values: ‘euclidean’.

pint, optional (default=None)

p value to use if using the minkowski metric.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

eom

leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

gen_min_span_treebool, optional (default=False)

Whether to populate the minimum_spanning_tree_ member for utilizing plotting tools. This requires the hdbscan CPU Python package to be installed.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

prediction_databool, optional (default=False)

Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True.

Attributes:

labels_ndarray, shape (n_samples, ): Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
probabilities_ndarray, shape (n_samples, ): The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.
cluster_persistence_ndarray, shape (n_clusters, ): A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores can be used to gauge the relative coherence of the clusters output by the algorithm.
condensed_tree_CondensedTree object: HDBSCAN.condensed_tree_(self)
single_linkage_tree_SingleLinkageTree object: HDBSCAN.single_linkage_tree_(self)
minimum_spanning_tree_MinimumSpanningTree object: HDBSCAN.minimum_spanning_tree_(self)

Methods

`fit`(self, X[, y, convert_dtype])	Fit HDBSCAN model from features.
`fit_predict`(self, X[, y])	Fit the HDBSCAN model from features and return cluster labels.
`generate_prediction_data`(self)	Create data that caches intermediate results used for predicting the label of new/unseen points.

property condensed_tree_#

property dtype#

fit(self, X, y=None, *, convert_dtype=True) → 'HDBSCAN'[source]#

Fit HDBSCAN model from features. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_predict(self, X, y=None) → CumlArray[source]#

Fit the HDBSCAN model from features and return cluster labels. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

generate_prediction_data(self)[source]#: Create data that caches intermediate results used for predicting the label of new/unseen points. This data is only useful if you are intending to use functions from hdbscan.prediction.

property minimum_spanning_tree_#

property prediction_data_#

property single_linkage_tree_#

cuml.cluster.hdbscan.all_points_membership_vectors(clusterer, batch_size=4096)[source]#

Predict soft cluster membership vectors for all points in the original dataset the clusterer was trained on. This function is more efficient by making use of the fact that all points are already in the condensed tree, and processing in bulk.

Parameters:

clustererHDBSCAN: A clustering object that has been fit to the data and had prediction_data=True set.
batch_sizeint, optional, default=min(4096, n_rows): Lowers memory requirement by computing distance-based membership in smaller batches of points in the training data. For example, a batch size of 1,000 computes distance based memberships for 1,000 points at a time. The default batch size is 4,096.

Returns:

membership_vectorsarray (n_samples, n_clusters): The probability that point i of the original dataset is a member of cluster j is in membership_vectors[i, j].

cuml.cluster.hdbscan.membership_vector(clusterer, points_to_predict, batch_size=4096, convert_dtype=True)[source]#

Predict soft cluster membership. The result produces a vector for each point in points_to_predict that gives a probability that the given point is a member of a cluster for each of the selected clusters of the clusterer.

Parameters:

clustererHDBSCAN: A clustering object that has been fit to the data and either had prediction_data=True set, or called the generate_prediction_data method after the fact.
points_to_predictarray, or array-like (n_samples, n_features): The new data points to predict cluster labels for. They should have the same dimensionality as the original dataset over which clusterer was fit.
batch_sizeint, optional, default=min(4096, n_points_to_predict): Lowers memory requirement by computing distance-based membership in smaller batches of points in the prediction data. For example, a batch size of 1,000 computes distance based memberships for 1,000 points at a time. The default batch size is 4,096.

Returns:

membership_vectorsarray (n_samples, n_clusters): The probability that point i is a member of cluster j is in membership_vectors[i, j].

cuml.cluster.hdbscan.approximate_predict(clusterer, points_to_predict, convert_dtype=True)[source]#

Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.

If you simply wish to assign new points to an existing clustering in the ‘best’ way possible, this is the function to use. If you want to predict how points_to_predict would cluster with the original data under HDBSCAN the most efficient existing approach is to simply recluster with the new point(s) added to the original dataset.

Parameters:

clustererHDBSCAN: A clustering object that has been fit to the data and had prediction_data=True set.
points_to_predictarray, or array-like (n_samples, n_features): The new data points to predict cluster labels for. They should have the same dimensionality as the original dataset over which clusterer was fit.

Returns:

labelsarray (n_samples,): The predicted labels of the points_to_predict
probabilitiesarray (n_samples,): The soft cluster scores for each of the points_to_predict

Dimensionality Reduction and Manifold Learning#

Principal Component Analysis#

class cuml.PCA(*, copy=True, handle=None, iterated_power=15, n_components=None, svd_solver='auto', tol=1e-07, verbose=False, whiten=False, output_type=None)#

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s PCA expects an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters:

copyboolean (default = True): If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
iterated_powerint (default = 15): Used in Jacobi solver. The more iterations, the more accurate, but slower.
n_componentsint (default = None): The number of top K singular vectors / values you want. Must be <= number(columns). If n_components is not set, then all components are kept:

n_components = min(n_samples, n_features)
svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’): Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
tolfloat (default = 1e-7): Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
whitenboolean (default = False): If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

components_array: The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
explained_variance_array: How much each component explains the variance in the data given by S**2
explained_variance_ratio_array: How much in % the variance is explained given by S**2/sum(S**2)
singular_values_array: The top K singular values. Remember all singular values >= 0
mean_array: The column wise mean of X. Used to mean - center the data first.
noise_variance_float: From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

`fit`(self, X[, y, convert_dtype])	Fit the model with X.
`fit_transform`(self, X[, y])	Fit the model with X and apply the dimensionality reduction on X.
`inverse_transform`(self, X, *[, ...])	Transform data back to its original space.
`transform`(self, X, *[, convert_dtype])	Apply dimensionality reduction to X.

Notes

PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For additional docs, see scikitlearn’s PCA.

Examples

>>> # Both import methods supported
>>> from cuml import PCA
>>> from cuml.decomposition import PCA

>>> import cudf
>>> import cupy as cp

>>> gdf_float = cudf.DataFrame()
>>> gdf_float['0'] = cp.asarray([1.0,2.0,5.0], dtype = cp.float32)
>>> gdf_float['1'] = cp.asarray([4.0,2.0,1.0], dtype = cp.float32)
>>> gdf_float['2'] = cp.asarray([4.0,2.0,1.0], dtype = cp.float32)

>>> pca_float = PCA(n_components = 2)
>>> pca_float.fit(gdf_float)
PCA()

>>> print(f'components: {pca_float.components_}')
components: 0           1           2
0  0.69225764  -0.5102837 -0.51028395
1 -0.72165036 -0.48949987  -0.4895003
>>> print(f'explained variance: {pca_float.explained_variance_}')
explained variance: 0   8.510...
1 0.489...
dtype: float32
>>> exp_var = pca_float.explained_variance_ratio_
>>> print(f'explained variance ratio: {exp_var}')
explained variance ratio: 0   0.9456...
1 0.054...
dtype: float32

>>> print(f'singular values: {pca_float.singular_values_}')
singular values: 0 4.125...
1 0.989...
dtype: float32
>>> print(f'mean: {pca_float.mean_}')
mean: 0 2.666...
1 2.333...
2 2.333...
dtype: float32
>>> trans_gdf_float = pca_float.transform(gdf_float)
>>> print(f'Inverse: {trans_gdf_float}')
Inverse: 0           1
0   -2.8547091 -0.42891636
1 -0.121316016  0.80743366
2    2.9760244 -0.37851727
>>> input_gdf_float = pca_float.inverse_transform(trans_gdf_float)
>>> print(f'Input: {input_gdf_float}')
Input: 0         1         2
0 1.0 4.0 4.0
1 2.0 2.0 2.0
2 5.0 1.0 1.0

fit(self, X, y=None, *, convert_dtype=True) → 'PCA'[source]#

Fit the model with X. y is currently ignored. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

fit_transform(self, X, y=None) → CumlArray[source]#

Fit the model with X and apply the dimensionality reduction on X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns:

transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)

Transformed values

For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.

inverse_transform(self, X, *, convert_dtype=False, return_sparse=False, sparse_tol=1e-10) → CumlArray[source]#

Transform data back to its original space.

In other words, return an input X_original whose transform would be X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = False): When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
return_sparsebool, optional (default = False): Ignored when the model is not fit on a sparse matrix If True, the method will convert the result to a cupyx.scipy.sparse.csr_matrix object. NOTE: Currently, there is a loss of information when converting to csr matrix (cusolver bug). Default will be switched to True once this is solved.
sparse_tolfloat, optional (default = 1e-10): Ignored when return_sparse=False. If True, values in the inverse transform below this parameter are clipped to 0.

Returns:

X_invcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_features)

Transformed values

For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.

transform(self, X, *, convert_dtype=True) → CumlArray[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)

Transformed values

For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.

Incremental PCA#

class cuml.IncrementalPCA(*, handle=None, n_components=None, whiten=False, copy=True, batch_size=None, verbose=False, output_type=None)[source]#

Based on sklearn.decomposition.IncrementalPCA from scikit-learn 0.23.1

Incremental principal components analysis (IPCA). Linear dimensionality reduction using Singular Value Decomposition of the data, keeping only the most significant singular vectors to project the data to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD. Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA, and allows sparse input. This algorithm has constant memory complexity, on the order of batch_size * n_features, enabling use of np.memmap files without loading the entire file into memory. For sparse matrices, the input is converted to dense in batches (in order to be able to subtract the mean) which avoids storing the entire dense matrix at any one time. The computational overhead of each SVD is O(batch_size * n_features ** 2), but only 2 * batch_size samples remain in memory at a time. There will be n_samples / batch_size SVD computations to get the principal components, versus 1 large SVD of complexity O(n_samples * n_features ** 2) for PCA.

Parameters:

handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
n_componentsint or None, (default=None): Number of components to keep. If n_components is None, then n_components is set to min(n_samples, n_features).
whitenbool, optional: If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
copybool, (default=True): If False, X will be overwritten. copy=False can be used to save memory but is unsafe for general use.
batch_sizeint or None, (default=None): The number of samples to use for each batch. Only used when calling fit. If batch_size is None, then batch_size is inferred from the data and set to 5 * n_features, to provide a balance between approximation accuracy and memory consumption.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

components_array, shape (n_components, n_features): Components with maximum variance.
explained_variance_array, shape (n_components,): Variance explained by each of the selected components.
explained_variance_ratio_array, shape (n_components,): Percentage of variance explained by each of the selected components. If all components are stored, the sum of explained variances is equal to 1.0.
singular_values_array, shape (n_components,): The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space.
mean_array, shape (n_features,): Per-feature empirical mean, aggregate over calls to partial_fit.
var_array, shape (n_features,): Per-feature empirical variance, aggregate over calls to partial_fit.
noise_variance_float: The estimated noise covariance following the Probabilistic PCA model from [4].
n_components_int: The estimated number of components. Relevant when n_components=None.
n_samples_seen_int: The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.
batch_size_int: Inferred batch size from batch_size.

Methods

`fit`(X[, y, convert_dtype])	Fit the model with X, using minibatches of size batch_size.
`partial_fit`(X[, y, check_input])	Incremental fit with X.
`transform`(X, *[, convert_dtype])	Apply dimensionality reduction to X.

Notes

Implements the incremental PCA model from [1]. This model is an extension of the Sequential Karhunen-Loeve Transform from [2]. We have specifically abstained from an optimization used by authors of both papers, a QR decomposition used in specific situations to reduce the algorithmic complexity of the SVD. The source for this technique is [3]. This technique has been omitted because it is advantageous only when decomposing a matrix with n_samples >= 5/3 * n_features where n_samples and n_features are the matrix rows and columns, respectively. In addition, it hurts the readability of the implemented algorithm. This would be a good opportunity for future optimization, if it is deemed necessary.

References

[1]

D. Ross, J. Lim, R. Lin, M. Yang. Incremental Learning for Robust Visual Tracking, International Journal of Computer Vision, Volume 77, Issue 1-3, pp. 125-141, May 2008.

[2]

A. Levy and M. Lindenbaum, Sequential Karhunen-Loeve Basis Extraction and its Application to Images, IEEE Transactions on Image Processing, Volume 9, Number 8, pp. 1371-1374, August 2000.

[3]

G. Golub and C. Van Loan. Matrix Computations, Third Edition, Chapter 5, Section 5.4.4, pp. 252-253.

[4]

C. Bishop, 1999. “Pattern Recognition and Machine Learning”, Section 12.2.1, pp. 574

Examples

>>> from cuml.decomposition import IncrementalPCA
>>> import cupy as cp
>>> import cupyx
>>>
>>> X = cupyx.scipy.sparse.random(1000, 4, format='csr',
...                               density=0.07, random_state=5)
>>> ipca = IncrementalPCA(n_components=2, batch_size=200)
>>> ipca.fit(X)
IncrementalPCA()
>>>
>>> # Components:
>>> ipca.components_
array([[ 0.23698335, -0.06073393,  0.04310868,  0.9686547 ],
       [ 0.27040346, -0.57185116,  0.76248786, -0.13594291]])
>>>
>>> # Singular Values:
>>> ipca.singular_values_
array([5.06637586, 4.59406975])
>>>
>>> # Explained Variance:
>>> ipca.explained_variance_
array([0.02569386, 0.0211266 ])
>>>
>>> # Explained Variance Ratio:
>>> ipca.explained_variance_ratio_
array([0.30424536, 0.25016372])
>>>
>>> # Mean:
>>> ipca.mean_
array([0.02693948, 0.0326928 , 0.03818463, 0.03861492])
>>>
>>> # Noise Variance:
>>> ipca.noise_variance_.item()
0.0037122774558343763

fit(X, y=None, *, convert_dtype=True) → IncrementalPCA[source]#

Fit the model with X, using minibatches of size batch_size.

Parameters:

Xarray-like or sparse matrix, shape (n_samples, n_features): Training data, where n_samples is the number of samples and n_features is the number of features.
yIgnored

Returns:

selfobject: Returns the instance itself.

partial_fit(X, y=None, *, check_input=True) → IncrementalPCA[source]#

Incremental fit with X. All of X is processed as a single batch.

Parameters:

Xarray-like or sparse matrix, shape (n_samples, n_features): Training data, where n_samples is the number of samples and n_features is the number of features.
check_inputbool: Run check_array on X.
yIgnored

Returns:

selfobject: Returns the instance itself.

transform(X, *, convert_dtype=False) → CumlArray[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set, using minibatches of size batch_size if X is sparse.

Parameters:

Xarray-like or sparse matrix, shape (n_samples, n_features): New data, where n_samples is the number of samples and n_features is the number of features.
convert_dtypebool, optional (default = False): When set to True, the transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

X_newarray-like, shape (n_samples, n_components)

Truncated SVD#

class cuml.TruncatedSVD(*, algorithm='full', handle=None, n_components=1, n_iter=15, random_state=None, tol=1e-07, verbose=False, output_type=None)#

TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.

cuML’s TruncatedSVD an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.

Parameters:

algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’): Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
n_componentsint (default = 1): The number of top K singular vectors / values you want. Must be <= number(columns).
n_iterint (default = 15): Used in Jacobi solver. The more iterations, the more accurate, but slower.
random_stateint / None (default = None): If you want results to be the same when you restart Python, select a state.
tolfloat (default = 1e-7): Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

components_array: The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
explained_variance_array: How much each component explains the variance in the data given by S**2
explained_variance_ratio_array: How much in % the variance is explained given by S**2/sum(S**2)
singular_values_array: The top K singular values. Remember all singular values >= 0

Methods

`fit`(self, X[, y])	Fit LSI model on training cudf DataFrame X.
`fit_transform`(self, X[, y, convert_dtype])	Fit LSI model to X and perform dimensionality reduction on X.
`inverse_transform`(self, X, *[, convert_dtype])	Transform X back to its original space.
`transform`(self, X, *[, convert_dtype])	Perform dimensionality reduction on X.

Notes

TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many, many components.

Applications of TruncatedSVD

TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.

For additional documentation, see scikitlearn’s TruncatedSVD docs.

Examples

>>> # Both import methods supported
>>> from cuml import TruncatedSVD
>>> from cuml.decomposition import TruncatedSVD

>>> import cudf
>>> import cupy as cp

>>> gdf_float = cudf.DataFrame()
>>> gdf_float['0'] = cp.asarray([1.0,2.0,5.0], dtype=cp.float32)
>>> gdf_float['1'] = cp.asarray([4.0,2.0,1.0], dtype=cp.float32)
>>> gdf_float['2'] = cp.asarray([4.0,2.0,1.0], dtype=cp.float32)

>>> tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi",
...                           n_iter = 20, tol = 1e-9)
>>> tsvd_float.fit(gdf_float)
TruncatedSVD()
>>> print(f'components: {tsvd_float.components_}')
components:           0         1         2
0  0.587259  0.572331  0.572331
1  0.809399 -0.415255 -0.415255
>>> exp_var = tsvd_float.explained_variance_
>>> print(f'explained variance: {exp_var}')
explained variance: 0    0.494...
1    5.505...
dtype: float32
>>> exp_var_ratio = tsvd_float.explained_variance_ratio_
>>> print(f'explained variance ratio: {exp_var_ratio}')
explained variance ratio: 0    0.082...
1    0.917...
dtype: float32
>>> sing_values = tsvd_float.singular_values_
>>> print(f'singular values: {sing_values}')
singular values: 0    7.439...
1    4.081...
dtype: float32

>>> trans_gdf_float = tsvd_float.transform(gdf_float)
>>> print(f'Transformed matrix: {trans_gdf_float}')
Transformed matrix:           0         1
0  5.165910 -2.512643
1  3.463844 -0.042223
2  4.080960  3.216484
>>> input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float)
>>> print(f'Input matrix: {input_gdf_float}')
Input matrix:      0    1    2
0  1.0  4.0  4.0
1  2.0  2.0  2.0
2  5.0  1.0  1.0

fit(self, X, y=None) → 'TruncatedSVD'[source]#

Fit LSI model on training cudf DataFrame X. y is currently ignored. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

fit_transform(self, X, y=None, *, convert_dtype=True) → CumlArray[source]#

Fit LSI model to X and perform dimensionality reduction on X. y is currently ignored. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Reduced version of X

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

inverse_transform(self, X, *, convert_dtype=False) → CumlArray[source]#

Transform X back to its original space. Returns X_original whose transform would be X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = False): When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

X_originalcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)

X in original space

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

transform(self, X, *, convert_dtype=True) → CumlArray[source]#

Perform dimensionality reduction on X. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Reduced version of X

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

UMAP#

class cuml.UMAP(*, n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, init='spectral', a=None, b=None, target_n_neighbors=-1, target_weight=0.5, target_metric='categorical', hash_input=False, random_state=None, precomputed_knn=None, callback=None, handle=None, verbose=False, build_algo='auto', build_kwds=None, output_type=None)#

Uniform Manifold Approximation and Projection

Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from lmcinnes/umap umap.py

The UMAP algorithm is outlined in [1]. This implementation follows the GPU-accelerated version as described in [2].

Parameters:

n_neighbors: float (optional, default 15)

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_components: int (optional, default 2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any

metric: string (default=’euclidean’).

Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The ‘jaccard’ distance metric is only supported for sparse inputs.

metric_kwds: dict (optional, default=None)

Metric argument

n_epochs: int (optional, default None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_rate: float (optional, default 1.0)

The initial learning rate for the embedding optimization.

init: string (optional, default ‘spectral’)

How to initialize the low dimensional embedding. Options are:

‘spectral’: use a spectral embedding of the fuzzy 1-skeleton
‘random’: assign initial embedding positions at random.

min_dist: float (optional, default 0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread: float (optional, default 1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratio: float (optional, default 1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity: int (optional, default 1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength: float (optional, default 1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size: float (optional, default 4.0)

For transform operations (embedding new points using a trained model this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

hash_input: bool, optional (default = False)

UMAP can hash the training input so that exact embeddings are returned when transform is called on the same data upon which the model was trained. This enables consistent behavior between calling model.fit_transform(X) and calling model.fit(X).transform(X). Note that the CPU-based UMAP reference implementation does this by default. This feature is made optional in the GPU version due to the significant overhead in copying memory to the host for computing the hash.

precomputed_knnarray / sparse array / tuple, optional (device or host)

Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.

random_stateint, RandomState instance or None, optional (default=None)

random_state is the seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Note: Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage.

callback: An instance of GraphBasedDimRedCallback class

Used to intercept the internal state of embeddings while they are being trained. Example of callback usage:

from cuml.internals import GraphBasedDimRedCallback

class CustomCallback(GraphBasedDimRedCallback):
    def on_preprocess_end(self, embeddings):
        print(embeddings.copy_to_host())

    def on_epoch_end(self, embeddings):
        print(embeddings.copy_to_host())

    def on_train_end(self, embeddings):
        print(embeddings.copy_to_host())

handlecuml.Handle or pylibraft.common.DeviceResourcesSNMG

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created. Using pylibraft.common.DeviceResourcesSNMG as the handle will run batched knn graph building using multiple GPUs. This will only be valid when build_algo=nn_descent and nnd_n_clusters > 1.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

build_algo: string (default=’auto’)

How to build the knn graph. Supported build algorithms are [‘auto’, ‘brute_force_knn’, ‘nn_descent’]. ‘auto’ chooses to run with brute force knn if number of data rows is smaller than or equal to 50K. Otherwise, runs with nn descent.

build_kwds: dict (optional, default=None)

Dictionary of parameters to configure the build algorithm. Default values:

nnd_graph_degree (int, default=64): Graph degree used for NN Descent. Must be ≥ n_neighbors.
nnd_intermediate_graph_degree (int, default=128): Intermediate graph degree for NN Descent. Must be > nnd_graph_degree.
nnd_max_iterations (int, default=20): Max NN Descent iterations.
nnd_termination_threshold (float, default=0.0001): Stricter threshold leads to better convergence but longer runtime.
nnd_n_clusters (int, default=1): Number of clusters for data partitioning. Higher values reduce memory usage at the cost of accuracy. When nnd_n_clusters > 1, UMAP can process data larger than device memory.
nnd_overlap_factor (int, default=2): Number of clusters each data point belongs to. Valid only when nnd_n_clusters > 1. Must be < ‘nnd_n_clusters’.

Hints:

Increasing nnd_graph_degree and nnd_max_iterations may improve accuracy.
The ratio nnd_overlap_factor / nnd_n_clusters impacts memory usage. Approximately (nnd_overlap_factor / nnd_n_clusters) * num_rows_in_entire_data rows will be loaded onto device memory at once. E.g., 2/20 uses less device memory than 2/10.
Larger nnd_overlap_factor results in better accuracy of the final knn graph. E.g. While using similar amount of device memory, (nnd_overlap_factor / nnd_n_clusters) = 4/20 will have better accuracy than 2/10 at the cost of performance.
Start with nnd_overlap_factor = 2 and gradually increase (2->3->4 …) for better accuracy.
Start with nnd_n_clusters = 4 and increase (4 → 8 → 16…) for less GPU memory usage. This is independent from nnd_overlap_factor as long as ‘nnd_overlap_factor’ < ‘nnd_n_clusters’.

Attributes:

embedding_

Methods

`find_ab_params`(spread, min_dist)
`fit`(self, X[, y, convert_dtype, knn_graph])	Fit X into an embedded space.
`fit_transform`(self, X[, y, convert_dtype, ...])	Fit X into an embedded space and return that transformed output.
`transform`(self, X, *[, convert_dtype])	Transform X into the existing embedded space and return that transformed output.
`validate_hyperparams`(self)

Notes

This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:

Using a pre-computed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP.

References

[1]

Leland McInnes, John Healy, James Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

[2]

Corey Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale, Tim Oates, John Zedlewski, Joshua Patterson Bringing UMAP Closer to the Speed of Light with GPU Acceleration

find_ab_params(spread, min_dist)[source]#

fit(self, X, y=None, *, convert_dtype=True, knn_graph=None) → 'UMAP'[source]#

Fit X into an embedded space.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.
harray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of
arrays of shape (n_samples, n_neighbors), a pairwise distances
dense array of shape (n_samples, n_samples) or a KNN graph
sparse array (preferably CSR/COO). This feature allows
the precomputation of the KNN outside of UMAP
and also allows the use of a custom distance function. This function
should match the metric used to train the UMAP embeedings.
Takes precedence over the precomputed_knn parameter.

fit_transform(self, X, y=None, *, convert_dtype=True, knn_graph=None) → CumlArray[source]#

Fit X into an embedded space and return that transformed output.

There is a subtle difference between calling fit_transform(X) and calling fit().transform(). Calling fit_transform(X) will train the embeddings on X and return the embeddings. Calling fit(X).transform(X) will train the embeddings on X and then run a second optimization.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features)

yarray-like (device or host) shape = (n_samples, 1)

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

hsparse array-like (device or host)

shape=(n_samples, n_samples) A sparse array containing the k-nearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that k>=n_neighbors, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeddings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR

Returns

——-

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Embedding of the data in low-dimensional space.

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

transform(self, X, *, convert_dtype=True) → CumlArray[source]#

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: lmcinnes/umap#158 Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Embedding of the data in low-dimensional space.

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

validate_hyperparams(self)[source]#

cuml.manifold.umap.fuzzy_simplicial_set(X, n_neighbors, random_state=None, metric='euclidean', metric_kwds=None, knn_indices=None, knn_dists=None, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=False)[source]#

Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.

Parameters:

X: array of shape (n_samples, n_features): The data to be modelled as a fuzzy simplicial set.
n_neighbors: int: The number of neighbors to use to approximate geodesic distance. Larger numbers induce more global estimates of the manifold that can miss finer detail, while smaller values will focus on fine manifold structure to the detriment of the larger picture.
random_state: numpy RandomState or equivalent: A state capable being used as a numpy random state.
metric: string (default=’euclidean’).: Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The ‘jaccard’ distance metric is only supported for sparse inputs.
metric_kwds: dict (optional, default=None): Metric argument
knn_indices: array of shape (n_samples, n_neighbors) (optional): If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the indices of the k-nearest neighbors as a row for each data point.
knn_dists: array of shape (n_samples, n_neighbors) (optional): If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the distances of the k-nearest neighbors as a row for each data point.
set_op_mix_ratio: float (optional, default 1.0): Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
local_connectivity: int (optional, default 1): The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
verbose: bool (optional, default False): Whether to report information on the current progress of the algorithm.
Returns
——-
fuzzy_simplicial_set: coo_matrix: A fuzzy simplicial set represented as a sparse matrix. The (i, j) entry of the matrix represents the membership strength of the 1-simplex between the ith and jth sample points.

cuml.manifold.umap.simplicial_set_embedding(data, graph, n_components=2, initial_alpha=1.0, a=None, b=None, gamma=1.0, negative_sample_rate=5, n_epochs=None, init='spectral', random_state=None, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, convert_dtype=True, verbose=False)[source]#

Perform a fuzzy simplicial set embedding, using a specified initialisation method and then minimizing the fuzzy set cross entropy between the 1-skeletons of the high and low dimensional fuzzy simplicial sets.

Parameters:

data: array of shape (n_samples, n_features)

The source data to be embedded by UMAP.

graph: sparse matrix

The 1-skeleton of the high dimensional fuzzy simplicial set as represented by a graph for which we require a sparse matrix for the (weighted) adjacency matrix.

n_components: int

The dimensionality of the euclidean space into which to embed the data.

initial_alpha: float

Initial learning rate for the SGD.

a: float

Parameter of differentiable approximation of right adjoint functor

b: float

Parameter of differentiable approximation of right adjoint functor

gamma: float

Weight to apply to negative samples.

negative_sample_rate: int (optional, default 5)

n_epochs: int (optional, default 0)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If 0 is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

init: string

How to initialize the low dimensional embedding. Options are:

‘spectral’: use a spectral embedding of the fuzzy 1-skeleton
‘random’: assign initial embedding positions at random.
An array-like with initial embedding positions.

random_state: numpy RandomState or equivalent

A state capable being used as a numpy random state.

metric: string (default=’euclidean’).

metric_kwds: dict (optional, default=None)

Metric argument

output_metric: function

Function returning the distance between two points in embedding space and the gradient of the distance wrt the first argument.

output_metric_kwds: dict

Key word arguments to be passed to the output_metric function.

verbose: bool (optional, default False)

Whether to report information on the current progress of the algorithm.

Returns

——-

embedding: array of shape (n_samples, n_components)

The optimized of graph into an n_components dimensional euclidean space.

Random Projections#

class cuml.random_projection.GaussianRandomProjection(n_components='auto', *, eps=0.1, random_state=None, output_type=None, handle=None, verbose=False)[source]#

Reduce dimensionality through Gaussian random projection.

The components of the random matrix are drawn from N(0, 1 / n_components).

Parameters:

n_componentsint or ‘auto’, default=’auto’

Dimensionality of the target projection space.

n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

epsfloat, default=0.1

Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. The value should be strictly positive.

Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

random_stateint, RandomState instance or None, default=None

Controls the pseudo random number generator used to generate the projection matrix at fit time.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

n_components_int: Concrete number of components computed when n_components=”auto”.
components_array of shape (n_components, n_features): Random matrix used for the projection.
n_features_in_int: Number of features seen during fit.

Notes

Inspired by Scikit-learn’s implementation: https://scikit-learn.org/stable/modules/random_projection.html

Currently passing a sparse array to transform may result in close (but not exactly identical) results due to cupy/cupy#9323.

Examples

>>> from cuml.random_projection import GaussianRandomProjection
>>> from cuml.datasets import make_blobs
>>> X, _ = make_blobs(n_samples=200, n_features=1000, random_state=42)
>>> model = GaussianRandomProjection(n_components=50, random_state=42)
>>> X_new = model.fit_transform(X)
>>> X_new.shape
(200, 50)

class cuml.random_projection.SparseRandomProjection(n_components='auto', *, density='auto', eps=0.1, dense_output=False, random_state=None, output_type=None, handle=None, verbose=False)[source]#

Reduce dimensionality through sparse random projection.

Sparse random matrix is an alternative to dense random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

If we note s = 1 / density the components of the random matrix are drawn from:

-sqrt(s) / sqrt(n_components)   with probability 1 / 2s
 0                              with probability 1 - 1 / s
+sqrt(s) / sqrt(n_components)   with probability 1 / 2s

Parameters:

n_componentsint or ‘auto’, default=’auto’

Dimensionality of the target projection space.

It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

densityfloat or ‘auto’, default=’auto’

Ratio in the range (0, 1] of non-zero component in the random projection matrix.

If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

epsfloat, default=0.1

Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. This value should be strictly positive.

Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

dense_outputbool, default=False

If True, ensure that the output of the random projection is a dense array even if the input and random projection matrix are both sparse. If False, the projected data uses a sparse representation if the input is sparse.

random_stateint, RandomState instance or None, default=None

Controls the pseudo random number generator used to generate the projection matrix at fit time.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

n_components_int: Concrete number of components computed when n_components=”auto”.
components_sparse matrix of shape (n_components, n_features): Random matrix used for the projection.
density_float in range 0.0 - 1.0: Concrete density computed from when density = “auto”.
n_features_in_int: Number of features seen during fit.

Notes

Inspired by Scikit-learn’s implementation: https://scikit-learn.org/stable/modules/random_projection.html

Currently passing a dense array to transform may result in close (but not exactly identical) results due to cupy/cupy#9323.

Examples

>>> from cuml.random_projection import SparseRandomProjection
>>> from cuml.datasets import make_blobs
>>> X, _ = make_blobs(n_samples=200, n_features=1000, random_state=42)
>>> model = SparseRandomProjection(n_components=50, random_state=42)
>>> X_new = model.fit_transform(X)
>>> X_new.shape
(200, 50)

cuml.random_projection.johnson_lindenstrauss_min_dim(n_samples, eps=0.1)[source]#

Find a ‘safe’ number of components to randomly project to.

The Johnson–Lindenstrauss lemma states that high-dimensional data can be embedded into lower dimension while preserving the distances.

This function finds the minimum number of components to guarantee that the embedding is inside the eps error tolerance.

Parameters:

n_samplesint: Number of samples.
epsfloat in (0,1) (default = 0.1): Maximum distortion rate as defined by the Johnson-Lindenstrauss lemma.

Returns:

n_componentsint: The minimal number of components to guarantee with good probability an eps-embedding with n_samples.

TSNE#

class cuml.TSNE(*, n_components=2, perplexity=30.0, early_exaggeration=12.0, late_exaggeration=1.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', metric_params=None, init='random', verbose=False, random_state=None, method='fft', angle=0.5, learning_rate_method='adaptive', n_neighbors=90, perplexity_max_iter=100, exaggeration_iter=250, pre_momentum=0.5, post_momentum=0.8, square_distances=True, precomputed_knn=None, handle=None, output_type=None)#

t-SNE (T-Distributed Stochastic Neighbor Embedding) is an extremely powerful dimensionality reduction technique that aims to maintain local distances between data points. It is extremely robust to whatever dataset you give it, and is used in many areas including cancer research, music analysis and neural network weight visualizations.

cuML’s t-SNE supports three algorithms: the original exact algorithm, the Barnes-Hut approximation and the fast Fourier transform interpolation approximation. The latter two are derived from CannyLabs’ open-source CUDA code and produce extremely fast embeddings when n_components = 2. The exact algorithm is more accurate, but too slow to use on large datasets.

Parameters:

n_componentsint (default 2): The output dimensionality size. Currently only 2 is supported.
perplexityfloat (default 30.0): Larger datasets require a larger value. Consider choosing different perplexity values from 5 to 50 and see the output differences.
early_exaggerationfloat (default 12.0): Controls the space between clusters. Not critical to tune this.
late_exaggerationfloat (default 1.0): Controls the space between clusters. It may be beneficial to increase this slightly to improve cluster separation. This will be applied after exaggeration_iter iterations (FFT only).
learning_ratefloat (default 200.0): The learning rate usually between (10, 1000). If this is too high, t-SNE could look like a cloud / ball of points.
n_iterint (default 1000): The more epochs, the more stable/accurate the final embedding.
n_iter_without_progressint (default 300): Currently unused. When the KL Divergence becomes too small after some iterations, terminate t-SNE early.
min_grad_normfloat (default 1e-07): The minimum gradient norm for when t-SNE will terminate early. Used in the ‘exact’ and ‘fft’ algorithms. Consider reducing if the embeddings are unsatisfactory. It’s recommended to use a smaller value for smaller datasets.
metricstr (default=’euclidean’).: Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘minkowski’, ‘chebyshev’, ‘cosine’, ‘correlation’]
initstr ‘random’ or ‘pca’ (default ‘random’): Currently supports random or pca initialization.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
random_stateint (default None): Setting this can make repeated runs look more similar. Note, however, that this highly parallelized t-SNE implementation is not completely deterministic between runs, even with the same random_state.
methodstr ‘fft’, ‘barnes_hut’ or ‘exact’ (default ‘fft’): ‘barnes_hut’ and ‘fft’ are fast approximations. ‘exact’ is more accurate but slower.
anglefloat (default 0.5): Valid values are between 0.0 and 1.0, which trade off speed and accuracy, respectively. Generally, these values are set between 0.2 and 0.8. (Barnes-Hut only.)
learning_rate_methodstr ‘adaptive’, ‘none’ or None (default ‘adaptive’): Either adaptive or None. ‘adaptive’ tunes the learning rate, early exaggeration, perplexity and n_neighbors automatically based on input size.
n_neighborsint (default 90): The number of datapoints you want to use in the attractive forces. Smaller values are better for preserving local structure, whilst larger values can improve global structure preservation. Default is 3 * 30 (perplexity)
perplexity_max_iterint (default 100): The number of epochs the best gaussian bands are found for.
exaggeration_iterint (default 250): To promote the growth of clusters, set this higher.
pre_momentumfloat (default 0.5): During the exaggeration iteration, more forcefully apply gradients.
post_momentumfloat (default 0.8): During the late phases, less forcefully apply gradients.
square_distancesboolean, default=True: Whether TSNE should square the distance values. Internally, this will be used to compute a kNN graph using the provided metric and then squaring it when True. If a knn_graph is passed to fit or fit_transform methods, all the distances will be squared when True. For example, if a knn_graph was obtained using ‘sqeuclidean’ metric, the distances will still be squared when True. Note: This argument should likely be set to False for distance metrics other than ‘euclidean’ and ‘l2’.
precomputed_knnarray / sparse array / tuple, optional (device or host): Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of TSNE and also allows the use of a custom distance function. This function should match the metric used to train the TSNE embeedings.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

kl_divergence_float: TSNE.kl_divergence_(self)
n_iter_int: Number of iterations run.

Methods

`fit`(self, X[, y, convert_dtype, knn_graph])	Fit X into an embedded space.
`fit_transform`(self, X[, y, convert_dtype, ...])	Fit X into an embedded space and return that transformed output.

References

[1]

van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding

[2]

van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.

[3]

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding

Tip

Maaten and Linderman showcased how t-SNE can be very sensitive to both the starting conditions (i.e. random initialization), and how parallel versions of t-SNE can generate vastly different results between runs. You can run t-SNE multiple times to settle on the best configuration. Note that using the same random_state across runs does not guarantee similar results each time.

Note

The CUDA implementation is derived from the excellent CannyLabs open source implementation here: CannyLab/tsne-cuda. The CannyLabs code is licensed according to the conditions in cuml/cpp/src/tsne/cannylabs_tsne_license.txt. A full description of their approach is available in their article t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data (https://arxiv.org/abs/1807.11824).

fit(self, X, y=None, *, convert_dtype=True, knn_graph=None) → 'TSNE'[source]#

Fit X into an embedded space.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.
harray / sparse array / tuple, optional (device or host)
Either one of a tuple (indices, distances) of
arrays of shape (n_samples, n_neighbors), a pairwise distances
dense array of shape (n_samples, n_samples) or a KNN graph
sparse array (preferably CSR/COO). This feature allows
the precomputation of the KNN outside of TSNE
and also allows the use of a custom distance function. This function
should match the metric used to train the TSNE embeedings.
Takes precedence over the precomputed_knn parameter.

fit_transform(self, X, y=None, *, convert_dtype=True, knn_graph=None) → CumlArray[source]#

Fit X into an embedded space and return that transformed output. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)

Embedding of the data in low-dimensional space.

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

property kl_divergence_#

Spectral Embedding#

class cuml.manifold.SpectralEmbedding(n_components=2, affinity='nearest_neighbors', random_state=None, n_neighbors=None, handle=None, verbose=False, output_type=None)#

Spectral embedding for non-linear dimensionality reduction.

Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph laplacian. The resulting transformation is given by the value of the eigenvectors for each data point.

Note : Laplacian Eigenmaps is the actual algorithm implemented here.

Parameters:

n_componentsint, default=2

The dimension of the projected subspace.

affinity{‘nearest_neighbors’, ‘precomputed’}, default=’nearest_neighbors’

How to construct the affinity matrix.

‘nearest_neighbors’ : construct the affinity matrix by computing a graph of nearest neighbors.
‘precomputed’ : interpret X as a precomputed affinity matrix.

random_stateint, RandomState instance or None, default=None

A pseudo random number generator used for the initialization. Use an int to make the results deterministic across calls.

n_neighborsint or None, default=2

Number of nearest neighbors for nearest_neighbors graph building. If None, n_neighbors will be set to max(n_samples/10, 1).

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

embedding_cupy.ndarray of shape (n_samples, n_components): Spectral embedding of the training matrix.
n_neighbors_int: Number of nearest neighbors effectively used.

Methods

`fit`(self, X[, y])	Fit the model from data in X.
`fit_transform`(self, X[, y])	Fit the model from data in X and transform X.

Notes

Spectral Embedding (Laplacian Eigenmaps) is most useful when the graph has one connected component. If there graph has many components, the first few eigenvectors will simply uncover the connected components of the graph.

Examples

>>> import cupy as cp
>>> from cuml.manifold import SpectralEmbedding
>>> X = cp.random.rand(100, 20, dtype=cp.float32)
>>> embedding = SpectralEmbedding(n_components=2, random_state=42)
>>> X_transformed = embedding.fit_transform(X)
>>> X_transformed.shape
(100, 2)

fit(self, X, y=None) → 'SpectralEmbedding'[source]#

Fit the model from data in X.

Parameters:

Xarray-like or sparse matrix of shape (n_samples, n_features) or (n_samples, n_samples): Training vector, where n_samples is the number of samples and n_features is the number of features. If affinity is ‘precomputed’, X is the affinity matrix. Supported formats for precomputed affinity: scipy sparse (CSR, CSC, COO), cupy sparse (CSR, CSC, COO), dense numpy arrays, or dense cupy arrays.
yIgnored: Not used, present for API consistency by convention.

Returns:

selfobject: Returns the instance itself.

fit_transform(self, X, y=None) → CumlArray[source]#

Fit the model from data in X and transform X.

Parameters:

Xarray-like or sparse matrix of shape (n_samples, n_features) or (n_samples, n_samples): Training vector, where n_samples is the number of samples and n_features is the number of features. If affinity is ‘precomputed’, X is the affinity matrix. Supported formats for precomputed affinity: scipy sparse (CSR, CSC, COO), cupy sparse (CSR, CSC, COO), dense numpy arrays, or dense cupy arrays.
yIgnored: Not used, present for API consistency by convention.

Returns:

X_newcupy.ndarray of shape (n_samples, n_components): Spectral embedding of the training matrix.

cuml.manifold.spectral_embedding(A, *, n_components=8, affinity='nearest_neighbors', random_state=None, n_neighbors=None, norm_laplacian=True, drop_first=True, handle=None)[source]#

Project the sample on the first eigenvectors of the graph Laplacian.

The adjacency matrix is used to compute a normalized graph Laplacian whose spectrum (especially the eigenvectors associated to the smallest eigenvalues) has an interpretation in terms of minimal number of cuts necessary to split the graph into comparably sized components.

Note : Laplacian Eigenmaps is the actual algorithm implemented here.

Parameters:

Aarray-like or sparse matrix of shape (n_samples, n_features) or (n_samples, n_samples)

If affinity is ‘nearest_neighbors’, this is the input data and a k-NN graph will be constructed. If affinity is ‘precomputed’, this is the affinity matrix. Supported formats for precomputed affinity: scipy sparse (CSR, CSC, COO), cupy sparse (CSR, CSC, COO), dense numpy arrays, or dense cupy arrays.

n_componentsint, default=8

The dimension of the projection subspace.

affinity{‘nearest_neighbors’, ‘precomputed’}, default=’nearest_neighbors’

How to construct the affinity matrix.

‘nearest_neighbors’ : construct the affinity matrix by computing a graph of nearest neighbors.
‘precomputed’ : interpret A as a precomputed affinity matrix.

random_stateint, RandomState instance or None, default=None

A pseudo random number generator used for the initialization. Use an int to make the results deterministic across calls.

n_neighborsint or None, default=None

Number of nearest neighbors for nearest_neighbors graph building. If None, n_neighbors will be set to max(n_samples/10, 1). Only used when A has shape (n_samples, n_features).

norm_laplacianbool, default=True

If True, then compute symmetric normalized Laplacian.

drop_firstbool, default=True

Whether to drop the first eigenvector. For spectral embedding, this should be True as the first eigenvector should be constant vector for connected graph, but for spectral clustering, this should be kept as False to retain the first eigenvector.

handlecuml.Handle

Returns:

embeddingcupy.ndarray of shape (n_samples, n_components): The reduced samples.

Notes

Examples

>>> import cupy as cp
>>> from cuml.manifold import spectral_embedding
>>> X = cp.random.rand(100, 20, dtype=cp.float32)
>>> embedding = spectral_embedding(X, n_components=2, random_state=42)
>>> embedding.shape
(100, 2)

Neighbors#

Nearest Neighbors#

class cuml.neighbors.NearestNeighbors(*, n_neighbors=5, verbose=False, handle=None, algorithm='auto', metric='euclidean', p=2, algo_params=None, metric_params=None, n_jobs=None, output_type=None)#

NearestNeighbors is an queries neighborhoods from a given set of datapoints. Currently, cuML supports k-NN queries, which define the neighborhood as the closest k neighbors to each query point.

Parameters:

n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

handlecuml.Handle

algorithmstring (default=’auto’)

The query algorithm to use. Valid options are:

'auto': to automatically select brute-force or random ball cover based on data shape and metric
'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports Haversine (2d) and Euclidean in 2d and 3d.
'brute': for brute-force, slow but produces exact results
'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only
'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations

metricstring (default=’euclidean’).

Distance metric to use. Supported distances are [‘l1, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘braycurtis’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘jensenshannon’, ‘cosine’, ‘correlation’]

pfloat (default=2)

Parameter for the Minkowski metric. When p = 1, this is equivalent to manhattan distance (l1), and euclidean distance (l2) for p = 2. For arbitrary p, minkowski distance (lp) is used.

algo_paramsdict, optional (default=None)

Used to configure the nearest neighbor algorithm to be used. If set to None, parameters will be generated automatically. Parameters for algorithm 'brute' when inputs are sparse:

batch_size_index : (int) number of rows in each batch of index array

batch_size_query : (int) number of rows in each batch of query array

Parameters for algorithm 'ivfflat':

nlist: (int) number of cells to partition dataset into

nprobe: (int) at query time, number of cells used for search

Parameters for algorithm 'ivfpq':

nlist: (int) number of cells to partition dataset into

nprobe: (int) at query time, number of cells used for search

M: (int) number of subquantizers

n_bits: (int) bits allocated per subquantizer

usePrecomputedTables : (bool) whether to use precomputed tables

metric_paramsdict, optional (default = None)

This is currently ignored.

n_jobsint (default = None)

Ignored, here for scikit-learn API compatibility.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

effective_metric_: NearestNeighbors.effective_metric_(self)
effective_metric_params_: NearestNeighbors.effective_metric_params_(self)

Methods

`fit`(self, X[, y, convert_dtype])	Fit GPU index for performing nearest neighbor queries.
`kneighbors`(self[, X, n_neighbors, ...])	Query the GPU index for the k nearest neighbors of column vectors in X.
`kneighbors_graph`(self[, X, n_neighbors, mode])	Find the k nearest neighbors of column vectors in X and return as a sparse matrix in CSR format.

Notes

For an additional example see the NearestNeighbors notebook.

For additional docs, see scikit-learn’s NearestNeighbors.

Pickling NearestNeighbors instances is supported for all algorithms. However, for RBC, IVFPQ or IVFFlat the index will currently be rebuilt upon load rather than serialized as part of the pickled binary. For approximate indices like IVFPQ or IVFFlat this may result in small differences between the original and reloaded models, as the generated indices may differ.

Examples

>>> import cudf
>>> from cuml.neighbors import NearestNeighbors
>>> from cuml.datasets import make_blobs

>>> X, _ = make_blobs(n_samples=5, centers=5,
...                   n_features=10, random_state=42)

>>> # build a cudf Dataframe
>>> X_cudf = cudf.DataFrame(X)

>>> # fit model
>>> model = NearestNeighbors(n_neighbors=3)
>>> model.fit(X)
NearestNeighbors()

>>> # get 3 nearest neighbors
>>> distances, indices = model.kneighbors(X_cudf)

>>> # print results
>>> print(indices)
0  1  2
0  0  3  1
1  1  3  0
2  2  4  0
3  3  0  1
4  4  2  0
>>> print(distances)
        0          1          2
0  0.007812  24.786566  26.399996
1  0.000000  24.786566  30.045017
2  0.007812   5.458400  27.051241
3  0.000000  26.399996  27.543869
4  0.000000   5.458400  29.583437

property effective_metric_#

property effective_metric_params_#

fit(self, X, y=None, *, convert_dtype=True) → 'NearestNeighbors'[source]#

Fit GPU index for performing nearest neighbor queries. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

kneighbors(self, X=None, n_neighbors=None, return_distance=True, *, convert_dtype=True, two_pass_precision=False) → Union[CumlArray, Tuple[CumlArray, CumlArray]][source]#

Query the GPU index for the k nearest neighbors of column vectors in X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
n_neighborsInteger: Number of neighbors to search. If not provided, the n_neighbors from the model instance is used (default=10)
return_distance: Boolean: If False, distances will not be returned
convert_dtypebool, optional (default = True): When set to True, the kneighbors method will automatically convert the inputs to np.float32.
two_pass_precisionbool, optional (default = False): When set to True, a slow second pass will be used to improve the precision of results returned for searches using L2-derived metrics. FAISS uses the Euclidean distance decomposition trick to compute distances in this case, which may result in numerical errors for certain data. In particular, when several samples are close to the query sample (relative to typical inter-sample distances), numerical instability may cause the computed distance between the query and itself to be larger than the computed distance between the query and another sample. As a result, the query is not returned as the nearest neighbor to itself. If this flag is set to true, distances to the query vectors will be recomputed with high precision for all retrieved samples, and the results will be re-sorted accordingly. Note that for large values of k or large numbers of query vectors, this correction becomes impractical in terms of both runtime and memory. It should be used with care and only when strictly necessary (when precise results are critical and samples may be tightly clustered).

Returns:

distancescuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, n_features): The distances of the k-nearest neighbors for each column vector in X
indicescuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, n_features): The indices of the k-nearest neighbors for each column vector in X

kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity') → SparseCumlArray[source]#

Find the k nearest neighbors of column vectors in X and return as a sparse matrix in CSR format.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
n_neighborsInteger: Number of neighbors to search. If not provided, the n_neighbors from the model instance is used
modestring (default=’connectivity’): Values in connectivity matrix: ‘connectivity’ returns the connectivity matrix with ones and zeros, ‘distance’ returns the edges as the distances between points with the requested metric.

Returns:

Asparse graph in CSR format, shape = (n_samples, n_samples_fit): n_samples_fit is the number of samples in the fitted data where A[i, j] is assigned the weight of the edge that connects i to j. Values will either be ones/zeros or the selected distance metric. Return types are either cupy’s CSR sparse graph (device) or numpy’s CSR sparse graph (host)

Nearest Neighbors Classification#

class cuml.neighbors.KNeighborsClassifier(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)#

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters:

n_neighborsint (default=5): Default number of neighbors to query
algorithmstring (default=’auto’): The query algorithm to use. Currently, only ‘brute’ is supported.
metricstring (default=’euclidean’).: Distance metric to use.
weightsstring (default=’uniform’): Sample weights to use. Currently, only the uniform strategy is supported.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

classes_: KNeighborsClassifier.classes_(self)
outputs_2d_: KNeighborsClassifier.outputs_2d_(self)
y

Methods

`fit`(self, X, y, *[, convert_dtype])	Fit a GPU index for k-nearest neighbors classifier model.
`predict`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors classifier to predict the labels for X Parameters ----------
`predict_proba`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors classifier to predict the label probabilities for X Parameters ----------

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsClassifier
>>> from cuml.datasets import make_blobs
>>> from cuml.model_selection import train_test_split

>>> X, y = make_blobs(n_samples=100, centers=5,
...                   n_features=10, random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsClassifier(n_neighbors=10)

>>> knn.fit(X_train, y_train)
KNeighborsClassifier()
>>> knn.predict(X_test)
array([1., 2., 2., 3., 4., 2., 4., 4., 2., 3., 1., 4., 3., 1., 3., 4., 3., # noqa: E501
    4., 1., 3.], dtype=float32)

property classes_#

fit(self, X, y, *, convert_dtype=True) → 'KNeighborsClassifier'[source]#

Fit a GPU index for k-nearest neighbors classifier model. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

property outputs_2d_#: Whether the output is 2d

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Use the trained k-nearest neighbors classifier to predict the labels for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels predicted

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, *, convert_dtype=True) → CumlArray | list[CumlArray][source]#

Use the trained k-nearest neighbors classifier to predict the label probabilities for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Nearest Neighbors Regression#

class cuml.neighbors.KNeighborsRegressor(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)#

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.

Parameters:

n_neighborsint (default=5)

Default number of neighbors to query

algorithmstring (default=’auto’)

The query algorithm to use. Valid options are:

'auto': to automatically select brute-force or random ball cover based on data shape and metric
'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports 2d Euclidean and Haversine.
'brute': for brute-force, slow but produces exact results
'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only
'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

handlecuml.Handle

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Attributes:

y

Methods

`fit`(self, X, y, *[, convert_dtype])	Fit a GPU index for k-nearest neighbors regression model.
`predict`(self, X, *[, convert_dtype])	Use the trained k-nearest neighbors regression model to predict the labels for X Parameters ----------

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsRegressor
>>> from cuml.datasets import make_regression
>>> from cuml.model_selection import train_test_split

>>> X, y = make_regression(n_samples=100, n_features=10,
...                        random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...   X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsRegressor(n_neighbors=10)
>>> knn.fit(X_train, y_train)
KNeighborsRegressor()
>>> knn.predict(X_test)
array([ 14.770798  ,  51.8834    ,  66.15657   ,  46.978275  ,
    21.589611  , -14.519918  , -60.25534   , -20.856869  ,
    29.869623  , -34.83317   ,   0.45447388, 120.39675   ,
    109.94834   ,  63.57794   , -17.956171  ,  78.77663   ,
    30.412262  ,  32.575233  ,  74.72834   , 122.276855  ],
dtype=float32)

fit(self, X, y, *, convert_dtype=True) → 'KNeighborsRegressor'[source]#

Fit a GPU index for k-nearest neighbors regression model. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

predict(self, X, *, convert_dtype=True) → CumlArray[source]#

Use the trained k-nearest neighbors regression model to predict the labels for X Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the method will automatically convert the inputs to np.float32.

Returns:

X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Kernel Density Estimation#

class cuml.neighbors.KernelDensity(*, bandwidth=1.0, kernel='gaussian', metric='euclidean', metric_params=None, output_type=None, handle=None, verbose=False)[source]#

Kernel Density Estimation. Computes a non-parametric density estimate from a finite data sample, smoothing the estimate according to a bandwidth parameter.

Parameters:

bandwidthfloat, default=1.0: The bandwidth of the kernel.
kernel{‘gaussian’, ‘tophat’, ‘epanechnikov’, ‘exponential’, ‘linear’, ‘cosine’}, default=’gaussian’: The kernel to use.
metricstr, default=’euclidean’: The distance metric to use. Note that not all metrics are valid with all algorithms. Note that the normalization of the density output is correct only for the Euclidean distance metric. Default is ‘euclidean’.
metric_paramsdict, default=None: Additional parameters to be passed to the tree for use with the metric.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

`fit`(X[, y, sample_weight, convert_dtype])	Fit the Kernel Density model on the data.
`sample`([n_samples, random_state])	Generate random samples from the model.
`score`(X[, y])	Compute the total log-likelihood under the model.
`score_samples`(X, *[, convert_dtype])	Compute the log-likelihood of each sample under the model.

Examples

>>> from cuml.neighbors import KernelDensity
>>> import cupy as cp
>>> rng = cp.random.RandomState(42)
>>> X = rng.random_sample((100, 3))
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)
>>> log_density = kde.score_samples(X[:3])

fit(X, y=None, sample_weight=None, *, convert_dtype=True)[source]#

Fit the Kernel Density model on the data.

Parameters:

Xarray-like of shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yNone: Ignored.
sample_weightarray-like of shape (n_samples,), default=None: List of sample weights attached to the data X.

Returns:

selfobject: Returns the instance itself.

sample(n_samples=1, random_state=None)[source]#

Generate random samples from the model. Currently, this is implemented only for gaussian and tophat kernels, and the Euclidean metric.

Parameters:

n_samplesint, default=1: Number of samples to generate.
random_stateint, cupy RandomState instance or None, default=None

Returns:

Xcupy array of shape (n_samples, n_features): List of samples.

score(X, y=None) → float[source]#

Compute the total log-likelihood under the model.

Parameters:

Xarray-like of shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yNone: Ignored.

Returns:

logprobfloat: Total log-likelihood of the data in X. This is normalized to be a probability density, so the value will be low for high-dimensional data.

score_samples(X, *, convert_dtype=True) → CumlArray[source]#

Compute the log-likelihood of each sample under the model.

Parameters:

Xarray-like of shape (n_samples, n_features): An array of points to query. Last dimension should match dimension of training data (n_features).

Returns:

densityndarray of shape (n_samples,): Log-likelihood of each sample in X. These are normalized to be probability densities, so values will be low for high-dimensional data.

Time Series#

HoltWinters#

class cuml.ExponentialSmoothing(endog, *, seasonal='additive', seasonal_periods=2, start_periods=2, ts_num=1, eps=0.00224, handle=None, verbose=False, output_type=None)#

Implements a HoltWinters time series analysis model which is used in both forecasting future entries in a time series as well as in providing exponential smoothing, where weights are assigned against historical data with exponentially decreasing impact. This is done by analyzing three components of the data: level, trend, and seasonality.

Parameters:

endogarray-like (device or host): Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy. Note: cuDF.DataFrame types assumes data is in columns, while all other datatypes assume data is in rows. The endogenous dataset to be operated on.
seasonal‘additive’, ‘add’, ‘multiplicative’, ‘mul’ (default = ‘additive’): Whether the seasonal trend should be calculated additively or multiplicatively.
seasonal_periodsint (default=2): The seasonality of the data (how often it repeats). For monthly data this should be 12, for weekly data, this should be 7.
start_periodsint (default=2): Number of seasons to be used for seasonal seed values
ts_numint (default=1): The number of different time series that were passed in the endog param.
epsnp.number > 0 (default=2.24e-3): The accuracy to which gradient descent should achieve. Note that changing this value may affect the forecasted results.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

SSE
forecasted_points
level
season
trend

Methods

`fit`(self)	Perform fitting on the given `endog` dataset.
`forecast`(self[, h, index])	Forecasts future points based on the fitted model.
`get_level`(self[, index])	Returns the level component of the model.
`get_season`(self[, index])	Returns the season component of the model.
`get_trend`(self[, index])	Returns the trend component of the model.
`score`(self[, index])	Returns the score of the model.

Notes

Known Limitations: This version of ExponentialSmoothing currently provides only a limited number of features when compared to the statsmodels.holtwinters.ExponentialSmoothing model. Noticeably, it lacks:

predictno support for in-sample prediction.
- rapidsai/cuml#875
hessianno support for returning Hessian matrix.
- rapidsai/cuml#880
informationno support for returning Fisher matrix.
- rapidsai/cuml#880
loglikeno support for returning Log-likelihood.
- rapidsai/cuml#880

Additionally, be warned that there may exist floating point instability issues in this model. Small values in endog may lead to faulty results. See rapidsai/cuml#888 for more information.

Known Differences: This version of ExponentialSmoothing differs from statsmodels in some other minor ways:

Cannot pass trend component or damped trend component
this version can take additional parameters eps, start_periods, ts_num, and handle
Score returns SSE rather than gradient logL rapidsai/cuml#876
This version provides get_level(), get_trend(), get_season()

Examples

>>> from cuml import ExponentialSmoothing
>>> import cudf
>>> import cupy as cp
>>> data = cudf.Series([1, 2, 3, 4, 5, 6,
...                     7, 8, 9, 10, 11, 12,
...                     2, 3, 4, 5, 6, 7,
...                     8, 9, 10, 11, 12, 13,
...                     3, 4, 5, 6, 7, 8, 9,
...                     10, 11, 12, 13, 14],
...                     dtype=cp.float64)
>>> cu_hw = ExponentialSmoothing(data, seasonal_periods=12).fit()
>>> cu_pred = cu_hw.forecast(4)
>>> print('Forecasted points:', cu_pred)
Forecasted points :
0    4.000143766093652
1    5.000000163513641
2    6.000000000174092
3    7.000000000000178

fit(self) → 'ExponentialSmoothing'[source]#: Perform fitting on the given endog dataset. Calculates the level, trend, season, and SSE components.

forecast(self, h=1, index=None)[source]#

Forecasts future points based on the fitted model.

Parameters:

hint (default=1): The number of points for each series to be forecasted.
indexint (default=None): The index of the time series from which you want forecasted points. if None, then a cudf.DataFrame of the forecasted points from all time series is returned.

Returns:

predscudf.DataFrame or cudf.Series: Series of forecasted points if index is provided. DataFrame of all forecasted points if index=None.

get_level(self, index=None)[source]#

Returns the level component of the model.

Parameters:

indexint (default=None): The index of the time series from which the level will be returned. if None, then all level components are returned in a cudf.Series.

Returns:

levelcudf.Series or cudf.DataFrame: The level component of the fitted model

get_season(self, index=None)[source]#

Returns the season component of the model.

Parameters:

indexint (default=None): The index of the time series from which the season will be returned. if None, then all season components are returned in a cudf.Series.

Returns:

season: cudf.Series or cudf.DataFrame: The season component of the fitted model

get_trend(self, index=None)[source]#

Returns the trend component of the model.

Parameters:

indexint (default=None): The index of the time series from which the trend will be returned. if None, then all trend components are returned in a cudf.Series.

Returns:

trendcudf.Series or cudf.DataFrame: The trend component of the fitted model.

score(self, index=None)[source]#

Returns the score of the model.

Note

Currently returns the SSE, rather than the gradient of the LogLikelihood. rapidsai/cuml#876

Parameters:

indexint (default=None): The index of the time series from which the SSE will be returned. if None, then all SSEs are returned in a cudf Series.

Returns:

scorenp.float32, np.float64, or cudf.Series: The SSE of the fitted model.

ARIMA#

class cuml.tsa.ARIMA(endog, *, order: Tuple[int, int, int] = (1, 1, 1), seasonal_order: Tuple[int, int, int, int] = (0, 0, 0, 0), exog=None, fit_intercept=True, simple_differencing=True, handle=None, verbose=False, output_type=None, convert_dtype=True)#

Implements a batched ARIMA model for in- and out-of-sample time-series prediction, with support for seasonality (SARIMA)

ARIMA stands for Auto-Regressive Integrated Moving Average. See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average

This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a batch of time series of the same length (or various lengths, using missing values at the start for padding). The implementation is designed to give the best performance when using large batches of time series.

Parameters:

endogdataframe or array-like (device or host): Endogenous variable, assumed to have each time series in columns. Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy. Missing values are accepted, represented by NaN.
orderTuple[int, int, int] (default=(1,1,1)): The ARIMA order (p, d, q) of the model
seasonal_orderTuple[int, int, int, int] (default=(0,0,0,0)): The seasonal ARIMA order (P, D, Q, s) of the model
exogdataframe or array-like (device or host) (default=None): Exogenous variables, assumed to have each time series in columns, such that variables associated with a same batch member are adjacent (number of columns: n_exog * batch_size) Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy. Missing values are not supported.
fit_interceptbool or int (default = True): Whether to include a constant trend mu in the model
simple_differencingbool or int (default = True): If True, the data is differenced before being passed to the Kalman filter. If False, differencing is part of the state-space model. In some cases this setting can be ignored: computing forecasts with confidence intervals will force it to False ; fitting with the CSS method will force it to True. Note: that forecasts are always for the original series, whereas statsmodels computes forecasts for the differenced series when simple_differencing is True.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
convert_dtypeboolean: When set to True, the model will automatically convert the inputs to np.float64.

Attributes:

orderARIMAOrder: The ARIMA order of the model (p, d, q, P, D, Q, s, k, n_exog)
d_ydevice array: Time series data on device
n_obsint: Number of observations
batch_sizeint: Number of time series in the batch
dtypenumpy.dtype: Floating-point type of the data and parameters
niternumpy.ndarray: After fitting, contains the number of iterations before convergence for each time series.

Methods

`fit`(self, start_params, object]] = None, ...)	Fit the ARIMA model to each time series.
`forecast`(self, int nsteps[, level, exog])	Forecast the given model `nsteps` into the future.
`get_fit_params`(self)	Get all the fit parameters.
`get_params`(self[, deep])	ARIMA is unable to be cloned at this time.
`pack`(self)	Pack parameters of the model into a linearized vector `x`
`predict`(self[, start, end, level, exog, ...])	Compute in-sample and/or out-of-sample prediction for each series
`set_fit_params`(self, params[, convert_dtype])	Set all the fit parameters.
`set_params`(self, **params)	ARIMA is unable to be cloned at this time.
`unpack`(self, x[, convert_dtype])	Unpack linearized parameter vector `x` into the separate parameter arrays of the model

Notes

Performance: Let \(r=max(p+s*P, q+s*Q+1)\). The device memory used for most operations is :math: O(mathtt{batch_size}*mathtt{n_obs} + mathtt{batch_size}*r^2). The execution time is a linear function of n_obs and batch_size (if batch_size is large), but grows very fast with r.

The performance is optimized for very large batch sizes (e.g thousands of series).

References

This class is heavily influenced by the Python library statsmodels, particularly statsmodels.tsa.statespace.sarimax.SARIMAX. See https://www.statsmodels.org/stable/statespace.html.

Additionally the following book is a useful reference: “Time Series Analysis by State Space Methods”, J. Durbin, S.J. Koopman, 2nd Edition (2012).

Examples

>>> import cupy as cp
>>> from cuml.tsa.arima import ARIMA

>>> # Create seasonal data with a trend, a seasonal pattern and noise
>>> n_obs = 100
>>> cp.random.seed(12)
>>> x = cp.linspace(0, 1, n_obs)
>>> pattern = cp.array([[0.05, 0.0], [0.07, 0.03],
...                     [-0.03, 0.05], [0.02, 0.025]])
>>> noise = cp.random.normal(scale=0.01, size=(n_obs, 2))
>>> y = (cp.column_stack((0.5*x, -0.25*x)) + noise
...     + cp.tile(pattern, (25, 1)))

>>> # Fit a seasonal ARIMA model
>>> model = ARIMA(y,
...               order=(0,1,1),
...               seasonal_order=(0,1,1,4),
...               fit_intercept=False)
>>> model.fit()
ARIMA(...)
>>> # Forecast
>>> fc = model.forecast(10)
>>> print(fc)
[[ 0.55204599 -0.25681163]
[ 0.57430705 -0.2262438 ]
[ 0.48120315 -0.20583011]
[ 0.535594   -0.24060046]
[ 0.57207541 -0.26695497]
[ 0.59433647 -0.23638713]
[ 0.50123257 -0.21597344]
[ 0.55562342 -0.25074379]
[ 0.59210483 -0.27709831]
[ 0.61436589 -0.24653047]]

property aic: CumlArray#: Akaike Information Criterion

property aicc: CumlArray#: Corrected Akaike Information Criterion

property bic: CumlArray#: Bayesian Information Criterion

property complexity#: Model complexity (number of parameters)

fit(self, start_params: Optional[Mapping[str, object]] = None, int opt_disp: int = -1, double h: float = 1e-8, int maxiter: int = 1000, method='ml', int truncate: int = 0, bool convert_dtype: bool = True) → 'ARIMA'[source]#

Fit the ARIMA model to each time series.

Parameters:

start_paramsMapping[str, array-like] (optional)

A mapping (e.g dictionary) of parameter names and associated arrays The key names are in {“mu”, “ar”, “ma”, “sar”, “sma”, “sigma2”} The shape of the arrays are (batch_size,) for mu and sigma2 parameters and (n, batch_size) for any other type, where n is the corresponding number of parameters of this type. Pass None for automatic estimation (recommended)

opt_dispint

Fit diagnostic level (for L-BFGS solver):

-1 for no output (default)
0<n<100 for output every n steps
n>100 for more detailed output

hfloat (default=1e-8)

Finite-differencing step size. The gradient is computed using forward finite differencing: \(g = \frac{f(x + \mathtt{h}) - f(x)}{\mathtt{h}} + O(\mathtt{h})\)

maxiterint (default=1000)

Maximum number of iterations of L-BFGS-B

methodstr (default=”ml”)

Estimation method - “css”, “css-ml” or “ml”. CSS uses a sum-of-squares approximation. ML estimates the log-likelihood with statespace methods. CSS-ML starts with CSS and refines with ML.

truncateint (default=0)

When using CSS, start the sum of squares after a given number of observations

forecast(self, int nsteps: int, level=None, exog=None) → CumlArray | Tuple[CumlArray, CumlArray, CumlArray][source]#

Forecast the given model nsteps into the future.

Parameters:

nstepsint: The number of steps to forecast beyond end of the given series
levelfloat or None (default = None): Confidence level for prediction intervals, or None to return only the point forecasts. 0 < level < 1
exogdataframe or array-like (device or host) (default=None): Future values for exogenous variables. Assumed to have each time series in columns, such that variables associated with a same batch member are adjacent. Shape = (nsteps, n_exog * batch_size)

Returns:

y_fcarray-like: Forecasts. Shape = (nsteps, batch_size)
lowerarray-like (device) (optional): Lower limit of the prediction interval if level != None Shape = (end - start, batch_size)
upperarray-like (device) (optional): Upper limit of the prediction interval if level != None Shape = (end - start, batch_size)

Examples

from cuml.tsa.arima import ARIMA
...
model = ARIMA(ys, order=(1,1,1))
model.fit()
y_fc = model.forecast(10)

get_fit_params(self) → Dict[str, CumlArray][source]#

Get all the fit parameters. Not to be confused with get_params Note: pack() can be used to get a compact vector of the parameters

Returns:

params: Dict[str, array-like]: A dictionary of parameter names and associated arrays The key names are in {“mu”, “ar”, “ma”, “sar”, “sma”, “sigma2”} The shape of the arrays are (batch_size,) for mu and sigma2 and (n, batch_size) for any other type, where n is the corresponding number of parameters of this type.

get_params(self, deep=True)[source]#: ARIMA is unable to be cloned at this time. The methods: _get_param_names(), get_params and set_params will raise NotImplementedError

property llf#: Log-likelihood of a fit model. Shape: (batch_size,)

pack(self) → np.ndarray[source]#

Pack parameters of the model into a linearized vector x

Returns:

xnumpy ndarray: Packed parameter array, grouped by series. Shape: (n_params * batch_size,)

predict(self, start=0, end=None, level=None, exog=None, convert_dtype=True) → CumlArray | Tuple[CumlArray, CumlArray, CumlArray][source]#

Compute in-sample and/or out-of-sample prediction for each series

Parameters:

startint (default = 0): Index where to start the predictions (0 <= start <= num_samples)
endint (default = None): Index where to end the predictions, excluded (end > start), or None to predict until the last observation
levelfloat or None (default = None): Confidence level for prediction intervals, or None to return only the point forecasts. 0 < level < 1
exogdataframe or array-like (device or host): Future values for exogenous variables. Assumed to have each time series in columns, such that variables associated with a same batch member are adjacent. Shape = (end - n_obs, n_exog * batch_size)

Returns:

y_parray-like (device): Predictions. Shape = (end - start, batch_size)
lower: array-like (device) (optional): Lower limit of the prediction interval if level != None Shape = (end - start, batch_size)
upper: array-like (device) (optional): Upper limit of the prediction interval if level != None Shape = (end - start, batch_size)

Examples

from cuml.tsa.arima import ARIMA

model = ARIMA(ys, order=(1,1,1))
model.fit()
y_pred = model.predict()

set_fit_params(self, params: Mapping[str, object], convert_dtype=True)[source]#

Set all the fit parameters. Not to be confused with set_params Note: unpack() can be used to load a compact vector of the parameters

Parameters:

params: Mapping[str, array-like]: A dictionary of parameter names and associated arrays The key names are in {“mu”, “ar”, “ma”, “sar”, “sma”, “sigma2”} The shape of the arrays are (batch_size,) for mu and sigma2 and (n, batch_size) for any other type, where n is the corresponding number of parameters of this type.

set_params(self, **params)[source]#: ARIMA is unable to be cloned at this time. The methods: _get_param_names(), get_params and set_params will raise NotImplementedError

unpack(self, x: list | np.ndarray, convert_dtype=True)[source]#

Unpack linearized parameter vector x into the separate parameter arrays of the model

Parameters:

xarray-like: Packed parameter array, grouped by series. Shape: (n_params * batch_size,)

class cuml.tsa.auto_arima.AutoARIMA(endog, *, handle=None, simple_differencing=True, verbose=False, output_type=None, convert_dtype=True)#

Implements a batched auto-ARIMA model for in- and out-of-sample times-series prediction.

This interface offers a highly customizable search, with functionality similar to the forecast and fable packages in R. It provides an abstraction around the underlying ARIMA models to predict and forecast as if using a single model.

Parameters:

endogdataframe or array-like (device or host): The time series data, assumed to have each time series in columns. Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
simple_differencing: bool or int, default=True: If True, the data is differenced before being passed to the Kalman filter. If False, differencing is part of the state-space model. See additional notes in the ARIMA docs
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
convert_dtypeboolean: When set to True, the model will automatically convert the inputs to np.float64.

Attributes:

d_y

Methods

`fit`(self, double h, int maxiter[, method])	Fits the selected models for their respective series
`forecast`(self, int nsteps[, level])	Forecast `nsteps` into the future.
`predict`(self[, start, end, level])	Compute in-sample and/or out-of-sample prediction for each series
`search`(self[, s, d, D, p, q, P, Q, ...])	Searches through the specified model space and associates each series to the most appropriate model.
`summary`(self)	Display a quick summary of the models selected by `search`

Notes

The interface was influenced by the R fable package: See https://fable.tidyverts.org/reference/ARIMA.html

References

A useful (though outdated) reference is the paper:

[1]

Rob J. Hyndman, Yeasmin Khandakar, 2008. “Automatic Time Series Forecasting: The ‘forecast’ Package for R”, Journal of Statistical Software 27

Examples

from cuml.tsa.auto_arima import AutoARIMA

model = AutoARIMA(y)
model.search(s=12, d=(0, 1), D=(0, 1), p=(0, 2, 4), q=(0, 2, 4),
             P=range(2), Q=range(2), method="css", truncate=100)
model.fit(method="css-ml")
fc = model.forecast(20)

fit(self, double h: float = 1e-8, int maxiter: int = 1000, method='ml', int truncate: int = 0)[source]#

Fits the selected models for their respective series

Parameters:

hfloat: Finite-differencing step size used to compute gradients in ARIMA
maxiterint: Maximum number of iterations of L-BFGS-B
methodstr: Estimation method - “css”, “css-ml” or “ml”. CSS uses a fast sum-of-squares approximation. ML estimates the log-likelihood with statespace methods. CSS-ML starts with CSS and refines with ML.
truncateint: When using CSS, start the sum of squares after a given number of observations for better performance (but often a worse fit)

forecast(self, int nsteps: int, level=None) → Union[CumlArray, Tuple[CumlArray, CumlArray, CumlArray]][source]#

Forecast nsteps into the future.

Parameters:

nstepsint: The number of steps to forecast beyond end of the given series
level: float or None (default = None): Confidence level for prediction intervals, or None to return only the point forecasts. 0 < level < 1

Returns:

y_fcarray-like: Forecasts. Shape = (nsteps, batch_size)
lower: array-like (device) (optional): Lower limit of the prediction interval if level != None Shape = (end - start, batch_size)
upper: array-like (device) (optional): Upper limit of the prediction interval if level != None Shape = (end - start, batch_size)

predict(self, start=0, end=None, level=None) → Union[CumlArray, Tuple[CumlArray, CumlArray, CumlArray]][source]#

Compute in-sample and/or out-of-sample prediction for each series

Parameters:

start: int: Index where to start the predictions (0 <= start <= num_samples)
end:: Index where to end the predictions, excluded (end > start)
level: float or None (default = None): Confidence level for prediction intervals, or None to return only the point forecasts. 0 < level < 1

Returns:

y_parray-like (device): Predictions. Shape = (end - start, batch_size)
lower: array-like (device) (optional): Lower limit of the prediction interval if level != None Shape = (end - start, batch_size)
upper: array-like (device) (optional): Upper limit of the prediction interval if level != None Shape = (end - start, batch_size)

search(self, s=None, d=range(3), D=range(2), p=range(1, 4), q=range(1, 4), P=range(3), Q=range(3), fit_intercept='auto', ic='aicc', test='kpss', seasonal_test='seas', double h: float = 1e-8, int maxiter: int = 1000, method='auto', int truncate: int = 0)[source]#

Searches through the specified model space and associates each series to the most appropriate model.

Parameters:

sint: Seasonal period. None or 0 for non-seasonal time series
dint, sequence or generator: Possible values for d (simple difference)
Dint, sequence or generator: Possible values for D (seasonal difference)
pint, sequence or generator: Possible values for p (AR order)
qint, sequence or generator: Possible values for q (MA order)
Pint, sequence or generator: Possible values for P (seasonal AR order)
Qint, sequence or generator: Possible values for Q (seasonal MA order)
fit_interceptint, sequence, generator or “auto”: Whether to fit an intercept. “auto” chooses based on the model parameters: it uses an incercept iff d + D <= 1
icstr: Which information criterion to use for the model selection. Currently supported: AIC, AICc, BIC
teststr: Which stationarity test to use to choose d. Currently supported: KPSS
seasonal_teststr: Which seasonality test to use to choose D. Currently supported: seas
hfloat: Finite-differencing step size used to compute gradients in ARIMA
maxiterint: Maximum number of iterations of L-BFGS-B
methodstr: Estimation method - “auto”, “css”, “css-ml” or “ml”. CSS uses a fast sum-of-squares approximation. ML estimates the log-likelihood with statespace methods. CSS-ML starts with CSS and refines with ML. “auto” will use CSS for long seasonal time series, ML otherwise.
truncateint: When using CSS, start the sum of squares after a given number of observations for better performance. Recommended for long time series when truncating doesn’t lose too much information.

summary(self)[source]#: Display a quick summary of the models selected by search

Model Explainability#

SHAP Kernel Explainer#

class cuml.explainer.KernelExplainer(*, model, data, nsamples='auto', link='identity', verbose=False, random_state=None, is_gpu_model=None, handle=None, dtype=None, output_type=None)#

GPU accelerated of SHAP’s kernel explainer.

cuML’s SHAP based explainers accelerate the algorithmic part of SHAP. They are optimized to be used with fast GPU based models, like those in cuML. By creating the datasets and internal calculations, alongside minimizing data copies and transfers, they can accelerate explanations significantly. But they can also be used with CPU based models, where speedups can still be achieved, but those can be capped by factors like data transfers and the speed of the models.

KernelExplainer is based on the Python SHAP package’s KernelExplainer class: slundberg/shap

Current characteristics of the GPU version:

Unlike the SHAP package, nsamples is a parameter at the initialization of the explainer and there is a small initialization time.

Only tabular data is supported for now, via passing the background dataset explicitly.

Sparse data support is planned for the near future.

Further optimizations are in progress. For example, if the background dataset has constant value columns and the observation has the same value in some entries, the number of evaluations of the function can be reduced (this will come in the next version).

Parameters:

modelfunction: Function that takes a matrix of samples (n_samples, n_features) and computes the output for those samples with shape (n_samples). Function must use either CuPy or NumPy arrays as input/output.
dataDense matrix containing floats or doubles.: cuML’s kernel SHAP supports tabular data for now, so it expects a background dataset, as opposed to a shap.masker object. The background dataset to use for integrating out features. To determine the impact of a feature, that feature is set to “missing” and the change in the model output is observed. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
nsamplesint (default = 2 * data.shape[1] + 2048): Number of times to re-evaluate the model when explaining each prediction. More samples lead to lower variance estimates of the SHAP values. The “auto” setting uses nsamples = 2 * X.shape[1] + 2048.
linkfunction or str (default = ‘identity’): The link function used to map between the output units of the model and the SHAP value units. From the SHAP package: The link function used to map between the output units of the model and the SHAP value units. By default it is identity, but logit can be useful so that expectations are computed in probability units while explanations remain in the (more naturally additive) log-odds units. For more details on how link functions work see any overview of link functions for generalized linear models.
random_state: int, RandomState instance or None (default = None): Seed for the random number generator for dataset creation. Note: due to the design of the sampling algorithm the concurrency can affect results, so currently 100% deterministic execution is not guaranteed.
gpu_modelbool or None (default = None): If None Explainer will try to infer whether model can take GPU data (as CuPy arrays), otherwise it will use NumPy arrays to call model. Set to True to force the explainer to use GPU data, set to False to force the Explainer to use NumPy data.
handlepylibraft.common.handle (default = None): Specifies the handle that holds internal CUDA state for computations in this model, a new one is created if it is None. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams.
dtypenp.float32 or np.float64 (default = None): Parameter to specify the precision of data to generate to call the model. If not specified, the explainer will try to get the dtype of the model, if it cannot be queried, then it will default to np.float32.
output_type‘cupy’ or ‘numpy’ (default = ‘numpy’): Parameter to specify the type of data to output. If not specified, the explainer will default to ‘numpy’ for the time being to improve compatibility.

Methods

shap_values(self, X[, l1_reg, as_list])

Interface to estimate the SHAP values for a set of samples.

Examples

>>> from cuml import SVR
>>> from cuml import make_regression
>>> from cuml import train_test_split
>>>
>>> from cuml.explainer import KernelExplainer
>>>
>>> X, y = make_regression(
...     n_samples=102,
...     n_features=10,
...     noise=0.1,
...     random_state=42)
>>>
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X,
...     y,
...     test_size=2,
...     random_state=42)
>>>
>>> model = SVR().fit(X_train, y_train)
>>>
>>> cu_explainer = KernelExplainer(
...     model=model.predict,
...     data=X_train,
...     is_gpu_model=True,
...     random_state=42)
>>>
>>> cu_shap_values = cu_explainer.shap_values(X_test)
>>> cu_shap_values
array([[-0.41163236, -0.29839307, -0.31082764, -0.21910861, 0.20798518,
      1.525831  , -0.07726735, -0.23897147, -0.5901833 , -0.03319931],
    [-0.37491834, -0.22581327, -1.2146976 ,  0.03793442, -0.24420738,
      -0.4875331 , -0.05438256, 0.16568947, -1.9978098 , -0.19110584]],
    dtype=float32)

shap_values(self, X, l1_reg='auto', as_list=True)[source]#

Interface to estimate the SHAP values for a set of samples. Corresponds to the SHAP package’s legacy interface, and is our main API currently.

Parameters:

XDense matrix containing floats or doubles.: Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
l1_regstr (default: ‘auto’): The l1 regularization to use for feature selection.
as_listbool (default = True): Set to True to return a list of arrays for multi-dimensional models (like predict_proba functions) to match the SHAP package behavior. Set to False to return them as an array of arrays.

Returns:

shap_valuesarray or list

SHAP Permutation Explainer#

class cuml.explainer.PermutationExplainer(*, model, data, masker_type='independent', link='identity', handle=None, is_gpu_model=None, random_state=None, dtype=None, output_type=None, verbose=False)#

GPU accelerated version of SHAP’s PermutationExplainer

PermutationExplainer is algorithmically similar and based on the Python SHAP package kernel explainer: slundberg/shap

This method approximates the Shapley values by iterating through permutations of the inputs. From the SHAP library docs: it guarantees local accuracy (additivity) by iterating completely through entire permutations of the features in both forward and reverse directions.

Current characteristics of the GPU version:

Only tabular data is supported for now, via passing the background dataset explicitly.

Hierarchical clustering for Owen values are planned for the near future.

Sparse data support is planned for the near future.

Setting the random seed:

This explainer uses CuPy to generate the permutations that are used, so to have reproducible results use CuPy’s seeding mechanism.

Parameters:

modelfunction: A callable python object that executes the model given a set of input data samples.
maskerDense matrix containing floats or doubles.: cuML’s permutation SHAP supports tabular data for now, so it expects a background dataset, as opposed to a shap.masker object. To respect a hierarchical structure of the data, use the (temporary) parameter masker_type Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
masker_type: {‘independent’, ‘partition’} default = ‘independent’: If ‘independent’ is used, then this is equivalent to SHAP’s independent masker and the algorithm is fully GPU accelerated. If ‘partition’ then it is equivalent to SHAP’s Partition masker, which respects a hierarchical structure in the background data.
linkfunction or str (default = ‘identity’): The link function used to map between the output units of the model and the SHAP value units. From the SHAP package: The link function used to map between the output units of the model and the SHAP value units. By default it is identity, but logit can be useful so that expectations are computed in probability units while explanations remain in the (more naturally additive) log-odds units. For more details on how link functions work see any overview of link functions for generalized linear models.
gpu_modelbool or None (default = None): If None Explainer will try to infer whether model can take GPU data (as CuPy arrays), otherwise it will use NumPy arrays to call model. Set to True to force the explainer to use GPU data, set to False to force the Explainer to use NumPy data.
handlepylibraft.common.handle (default = None): Specifies the handle that holds internal CUDA state for computations in this model, a new one is created if it is None. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams.
dtypenp.float32 or np.float64 (default = None): Parameter to specify the precision of data to generate to call the model. If not specified, the explainer will try to get the dtype of the model, if it cannot be queried, then it will default to np.float32.
output_type‘cupy’ or ‘numpy’ (default = ‘numpy’): Parameter to specify the type of data to output. If not specified, the explainer will default to ‘numpy’ for the time being to improve compatibility.

Methods

shap_values(self, X[, npermutations, as_list])

Interface to estimate the SHAP values for a set of samples.

Examples

>>> from cuml import SVR
>>> from cuml import make_regression
>>> from cuml import train_test_split

>>> from cuml.explainer import PermutationExplainer

>>> X, y = make_regression(
...     n_samples=102,
...     n_features=10,
...     noise=0.1,
...     random_state=42)
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X,
...     y,
...     test_size=2,
...     random_state=42)
>>> model = SVR().fit(X_train, y_train)

>>> cu_explainer = PermutationExplainer(
...     model=model.predict,
...     data=X_train,
...     random_state=42)

>>> cu_shap_values = cu_explainer.shap_values(X_test)
>>> cu_shap_values
array([[ 0.16611198, 0.74156773, 0.05906528,  0.30015892, 2.5425286 ,
        0.0970122 , 0.12258395, 2.1998262 , -0.02968234, -0.8669155 ],
    [-0.10587756,  0.77705824, -0.08259875, -0.71874434,  1.781551  ,
        -0.05454511, 0.11826539, -1.1734306 , -0.09629871, 0.4571011]],
    dtype=float32)

shap_values(self, X, npermutations=10, as_list=True, **kwargs)[source]#

Interface to estimate the SHAP values for a set of samples. Corresponds to the SHAP package’s legacy interface, and is our main API currently.

Parameters:

XDense matrix containing floats or doubles.: Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
npermutationsint (default = 10): Number of times to cycle through all the features, re-evaluating the model at each step. Each cycle evaluates the model function 2 * (# features + 1) times on a data matrix of (# background data samples) rows. An exception to this is when PermutationExplainer can avoid evaluating the model because a feature’s value is the same in X and the background dataset (which is common for example with sparse features).
as_listbool (default = True): Set to True to return a list of arrays for multi-dimensional models (like predict_proba functions) to match the SHAP package shap_values API behavior. Set to False to return them as an array of arrays.

Returns:

shap_valuesarray or list

Multi-Node, Multi-GPU Algorithms#

DBSCAN Clustering#

class cuml.dask.cluster.DBSCAN(*, client=None, verbose=False, **kwargs)[source]#

Multi-Node Multi-GPU implementation of DBSCAN.

The whole dataset is copied to all the workers but the work is then divided by giving “ownership” of a subset to each worker: each worker computes a clustering by considering the relationships between those points and the rest of the dataset, and partial results are merged at the end to obtain the final clustering.

Parameters:

clientdask.distributed.Client: Dask client to use
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
min_samplesint (default = 5): The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
max_mbytes_per_batch(optional) int64: Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
calc_core_sample_indices(optional) boolean (default = True): Indicates whether the indices of the core samples should be calculated. The the attribute core_sample_indices_ will not be used, setting this to False will avoid unnecessary kernel launches

Methods

`fit`(X[, out_dtype])	Fit a multi-node multi-GPU DBSCAN model
`fit_predict`(X[, out_dtype])	Performs clustering on X and returns cluster labels.

Notes

For additional docs, see the documentation of the single-GPU DBSCAN model

fit(X, out_dtype='int32')[source]#

Fit a multi-node multi-GPU DBSCAN model

Parameters:

Xarray-like (device or host): Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
out_dtype: dtype Determines the precision of the output labels array.: default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.

fit_predict(X, out_dtype='int32')[source]#

Performs clustering on X and returns cluster labels.

Parameters:

Xarray-like (device or host): Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
out_dtype: dtype Determines the precision of the output labels array.: default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.
Returns
——-
labels: array-like (device or host): Integer array of labels

K-Means Clustering#

class cuml.dask.cluster.KMeans(*, client=None, verbose=False, **kwargs)[source]#

Multi-Node Multi-GPU implementation of KMeans.

This version minimizes data transfer by sharing only the centroids between workers in each iteration.

Predictions are done embarrassingly parallel, using cuML’s single-GPU version.

For more information on this implementation, refer to the documentation for single-GPU K-Means.

Parameters:

handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
n_clustersint (default = 8): The number of centroids or clusters you want.
max_iterint (default = 300): The more iterations of EM, the more accurate, but slower.
tolfloat (default = 1e-4): Stopping criterion when centroid means do not change much.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
random_stateint or None (default = None): If you want results to be the same when you restart Python, select a state.
init{‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray} (default = ‘scalable-k-means++’): ‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ initialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
oversampling_factorint (default = 2): The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.
max_samples_per_batchint (default = 32768): The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

Attributes:

cluster_centers_cuDF DataFrame or CuPy ndarray: The coordinates of the final clusters. This represents of “mean” of each data cluster.

Methods

`fit`(X[, sample_weight])	Fit a multi-node multi-GPU KMeans model
`fit_predict`(X[, sample_weight, delayed])	Compute cluster centers and predict cluster index for each sample.
`fit_transform`(X[, sample_weight, delayed])	Calls fit followed by transform using a distributed KMeans model
`predict`(X[, delayed])	Predict labels for the input
`score`(X[, sample_weight])	Computes the inertia score for the trained KMeans centroids.
`transform`(X[, delayed])	Transforms the input into the learned centroid space

fit(X, sample_weight=None)[source]#

Fit a multi-node multi-GPU KMeans model

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array
Training data to cluster.
sample_weightDask cuDF DataFrame or CuPy backed Dask Array shape = (n_samples,), default=None # noqa: The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict(X, sample_weight=None, delayed=True)[source]#

Compute cluster centers and predict cluster index for each sample.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing predictions

fit_transform(X, sample_weight=None, delayed=True)[source]#

Calls fit followed by transform using a distributed KMeans model

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to execute as a delayed task or eager.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing the transformed data

predict(X, delayed=True)[source]#

Predict labels for the input

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing predictions

score(X, sample_weight=None)[source]#

Computes the inertia score for the trained KMeans centroids.

Parameters:

Xdask_cudf.Dataframe: Dataframe to compute score

Returns:

Inertial score

transform(X, delayed=True)[source]#

Transforms the input into the learned centroid space

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Data to predict
delayedbool (default = True): Whether to execute as a delayed task or eager.

Returns:

result: Dask cuDF DataFrame or CuPy backed Dask Array: Distributed object containing the transformed data

Nearest Neighbors#

class cuml.dask.neighbors.NearestNeighbors(*, client=None, streams_per_handle=0, **kwargs)[source]#

Multi-node Multi-GPU NearestNeighbors Model.

Parameters:

n_neighborsint (default=5): Default number of neighbors to query
batch_size: int (optional, default 2000000): Maximum number of query rows processed at once. This parameter can greatly affect the throughput of the algorithm. The optimal setting of this value will vary for different layouts index to query ratios, but it will require batch_size * n_features * 4 bytes of additional memory on each worker hosting index partitions.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

`fit`(X)	Fit a multi-node multi-GPU Nearest Neighbors index
`get_neighbors`(n_neighbors)	Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.
`kneighbors`([X, n_neighbors, ...])	Query the distributed nearest neighbors index

fit(X)[source]#

Fit a multi-node multi-GPU Nearest Neighbors index

Parameters:

Xdask_cudf.Dataframe

Returns:

self: NearestNeighbors model

get_neighbors(n_neighbors)[source]#

Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.

Parameters:

n_neighborsint: Number of neighbors

Returns:

n_neighbors: int: Default n_neighbors if parameter n_neighbors is none

kneighbors(X=None, n_neighbors=None, return_distance=True, _return_futures=False)[source]#

Query the distributed nearest neighbors index

Parameters:

Xdask_cudf.Dataframe: Vectors to query. If not provided, neighbors of each indexed point are returned.
n_neighborsint: Number of neighbors to query for each row in X. If not provided, the n_neighbors on the model are used.
return_distanceboolean (default=True): If false, only indices are returned

Returns:

rettuple (dask_cudf.DataFrame, dask_cudf.DataFrame): First dask-cuDF DataFrame contains distances, second contains the indices.

class cuml.dask.neighbors.KNeighborsRegressor(*, client=None, streams_per_handle=0, verbose=False, **kwargs)[source]#

Multi-node Multi-GPU K-Nearest Neighbors Regressor Model.

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters:

n_neighborsint (default=5): Default number of neighbors to query
batch_size: int (optional, default 2000000): Maximum number of query rows processed at once. This parameter can greatly affect the throughput of the algorithm. The optimal setting of this value will vary for different layouts and index to query ratios, but it will require batch_size * n_features * 4 bytes of additional memory on each worker hosting index partitions.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

`fit`(X, y)	Fit a multi-node multi-GPU K-Nearest Neighbors Regressor index
`predict`(X[, convert_dtype])	Predict outputs for a query from previously stored index and outputs.
`score`(X, y)	Provide score by comparing predictions and ground truth.

fit(X, y)[source]#

Fit a multi-node multi-GPU K-Nearest Neighbors Regressor index

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Index data. Acceptable formats: dask CuPy/NumPy/Numba Array
yarray-like (device or host) shape = (n_samples, n_features): Index output data. Acceptable formats: dask CuPy/NumPy/Numba Array

Returns:

selfKNeighborsRegressor model

predict(X, convert_dtype=True)[source]#

Predict outputs for a query from previously stored index and outputs. The process is done in a multi-node multi-GPU fashion.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Query data. Acceptable formats: dask cuDF, dask CuPy/NumPy/Numba Array
convert_dtypebool, optional (default = True): When set to True, the predict method will automatically convert the data to the right formats.

Returns:

predictionsDask futures or Dask CuPy Arrays

score(X, y)[source]#

Provide score by comparing predictions and ground truth.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Query test data. Acceptable formats: dask CuPy/NumPy/Numba Array
yarray-like (device or host) shape = (n_samples, n_features): Outputs test data. Acceptable formats: dask CuPy/NumPy/Numba Array

Returns:

score

class cuml.dask.neighbors.KNeighborsClassifier(*, client=None, streams_per_handle=0, verbose=False, **kwargs)[source]#

Multi-node Multi-GPU K-Nearest Neighbors Classifier Model.

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters:

n_neighborsint (default=5): Default number of neighbors to query
batch_size: int (optional, default 2000000): Maximum number of query rows processed at once. This parameter can greatly affect the throughput of the algorithm. The optimal setting of this value will vary for different layouts and index to query ratios, but it will require batch_size * n_features * 4 bytes of additional memory on each worker hosting index partitions.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Methods

`fit`(X, y)	Fit a multi-node multi-GPU K-Nearest Neighbors Classifier index
`predict`(X[, convert_dtype])	Predict labels for a query from previously stored index and index labels.
`predict_proba`(X[, convert_dtype])	Provide score by comparing predictions and ground truth.
`score`(X, y[, convert_dtype])	Predict labels for a query from previously stored index and index labels.

fit(X, y)[source]#

Fit a multi-node multi-GPU K-Nearest Neighbors Classifier index

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Index data. Acceptable formats: dask CuPy/NumPy/Numba Array
yarray-like (device or host) shape = (n_samples, n_features): Index labels data. Acceptable formats: dask CuPy/NumPy/Numba Array

Returns:

selfKNeighborsClassifier model

predict(X, convert_dtype=True)[source]#

Predict labels for a query from previously stored index and index labels. The process is done in a multi-node multi-GPU fashion.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Query data. Acceptable formats: dask cuDF, dask CuPy/NumPy/Numba Array
convert_dtypebool, optional (default = True): When set to True, the predict method will automatically convert the data to the right formats.

Returns:

predictionsDask futures or Dask CuPy Arrays

predict_proba(X, convert_dtype=True)[source]#

Provide score by comparing predictions and ground truth.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Query data. Acceptable formats: dask cuDF, dask CuPy/NumPy/Numba Array
convert_dtypebool, optional (default = True): When set to True, the predict method will automatically convert the data to the right formats.

Returns:

probabilitiesDask futures or Dask CuPy Arrays

score(X, y, convert_dtype=True)[source]#

Predict labels for a query from previously stored index and index labels. The process is done in a multi-node multi-GPU fashion.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Query test data. Acceptable formats: dask CuPy/NumPy/Numba Array
yarray-like (device or host) shape = (n_samples, n_features): Labels test data. Acceptable formats: dask CuPy/NumPy/Numba Array

Returns:

score

Principal Component Analysis#

class cuml.dask.decomposition.PCA(*, client=None, verbose=False, **kwargs)[source]#

cuML’s multi-node multi-gpu (MNMG) PCA expects a dask-cuDF object as input and provides 2 algorithms, Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm can be much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters:

handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
n_componentsint (default = 1): The number of top K singular vectors / values you want. Must be <= number(columns).
svd_solver‘full’, ‘jacobi’, ‘auto’: ‘full’: Run exact full SVD and select the components by postprocessing ‘jacobi’: Iteratively compute SVD of the covariance matrix ‘auto’: For compatibility with Scikit-learn. Alias for ‘jacobi’.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
whitenboolean (default = False): If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Attributes:

components_array: The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
explained_variance_array: How much each component explains the variance in the data given by S**2
explained_variance_ratio_array: How much in % the variance is explained given by S**2/sum(S**2)
singular_values_array: The top K singular values. Remember all singular values >= 0
mean_array: The column wise mean of X. Used to mean - center the data first.
noise_variance_float: From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

`fit`(X)	Fit the model with X.
`fit_transform`(X)	Fit the model with X and apply the dimensionality reduction on X.
`inverse_transform`(X[, delayed])	Transform data back to its original space.
`transform`(X[, delayed])	Apply dimensionality reduction to X.

Notes

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For additional docs, see scikitlearn’s PCA.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client, wait
>>> import cupy as cp
>>> from cuml.dask.decomposition import PCA
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> nrows = 6
>>> ncols = 3
>>> n_parts = 2

>>> X_cudf, _ = make_blobs(n_samples=nrows, n_features=ncols,
...                        centers=1, n_parts=n_parts,
...                        cluster_std=0.01, random_state=10,
...                        dtype=cp.float32)

>>> blobs = X_cudf.compute()
>>> print(blobs)
[[8.688037  3.122401  1.2581943]
[8.705028  3.1070278 1.2705998]
[8.70239   3.1102846 1.2716919]
[8.695665  3.1042147 1.2635932]
[8.681095  3.0980906 1.2745825]
[8.705454  3.100002  1.2657361]]

>>> cumlModel = PCA(n_components = 1, whiten=False)
>>> XT = cumlModel.fit_transform(X_cudf)
>>> print(XT.compute())
[[-1.7516235e-02]
[ 7.8094802e-03]
[ 4.2757220e-03]
[-6.7228684e-05]
[-5.0618490e-03]
[ 1.0557819e-02]]
>>> client.close()
>>> cluster.close()

fit(X)[source]#

Fit the model with X.

Parameters:

Xdask cuDF input

fit_transform(X)[source]#

Fit the model with X and apply the dimensionality reduction on X.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF

inverse_transform(X, delayed=True)[source]#

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:

Xdask cuDF

Returns:

X_originaldask cuDF

transform(X, delayed=True)[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF

Random Forest#

class cuml.dask.ensemble.RandomForestClassifier(*, workers=None, client=None, verbose=False, n_estimators=100, random_state=None, ignore_empty_partitions=False, **kwargs)[source]#

Experimental API implementing a multi-GPU Random Forest classifier model which fits multiple decision tree classifiers in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).

Currently, this API makes the following assumptions:

The set of Dask workers used between instantiation, fit, and predict are all consistent
Training data comes in the form of cuDF dataframes or Dask Arrays distributed so that each worker has at least one partition.

Future versions of the API will support more flexible data distribution and additional input types.

The distributed algorithm uses an embarrassingly-parallel approach. For a forest with N trees being built on w workers, each worker simply builds N/w trees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so that rf.fit receives w partitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.

Please check the single-GPU implementation of Random Forest classifier for more information about the underlying algorithm.

Parameters:

n_estimatorsint (default = 100)

total number of trees in the forest (not per-worker)

handlecuml.Handle

split_criterionint or string (default = 0 ('gini'))

The criterion used to split nodes.

0 or 'gini' for gini impurity

1 or 'entropy' for information gain (entropy)

2 or 'mse' for mean squared error

4 or 'poisson' for poisson half deviance

5 or 'gamma' for gamma half deviance

6 or 'inverse_gaussian' for inverse gaussian deviance

2, 'mse', 4, 'poisson', 5, 'gamma', 6, 'inverse_gaussian' not valid for classification

bootstrapboolean (default = True)

Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement.

If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_featuresfloat (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

If type int then max_features is the absolute count of features to be used.

If type float then max_features is a fraction.

If 'auto' then max_features=n_features = 1.0.

If 'sqrt' then max_features=1/sqrt(n_features).

If 'log2' then max_features=log2(n_features)/n_features.

If None, then max_features = 1.0.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

If type int, then min_samples_leaf represents the minimum number.

If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

If type int, then min_samples_split represents the minimum number.

If type float, then min_samples_split represents a fraction and ceil(min_samples_split * n_rows) is the minimum number of samples for each split.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

ignore_empty_partitions: Boolean (default = False)

Specify behavior when a worker does not hold any data while splitting. When True, it returns the results from workers with data (the number of trained estimators will be less than n_estimators) When False, throws a RuntimeError. This is an experimental parameter, and may be removed in the future.

Methods

`fit`(X, y[, convert_dtype, broadcast_data])	Fit the input data with a Random Forest classifier
`get_params`([deep])	Returns the value of all parameters required to configure this estimator as a dictionary.
`predict`(X[, threshold, convert_dtype, ...])	Predicts the labels for X.
`predict_proba`(X[, delayed])	Predicts the probability of each class for X.
`set_params`(**params)	Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

partial_inference

Examples

For usage examples, please see the RAPIDS notebooks repository: rapidsai/cuml

fit(X, y, convert_dtype=False, broadcast_data=False)[source]#

Fit the input data with a Random Forest classifier

IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).

If a worker has multiple data partitions, they will be concatenated before fitting, which will lead to additional memory usage. To minimize memory consumption, ensure that each worker has exactly one partition.

When persisting data, you can use cuml.dask.common.utils.persist_across_workers to simplify this:

X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client,
                                                  [X_dask_cudf,
                                                   y_dask_cudf])

This is equivalent to calling persist with the data and workers:

X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,
                                                y_dask_cudf],
                                               workers={
                                               X_dask_cudf:workers,
                                               y_dask_cudf:workers
                                               })

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): Labels of training examples. y must be partitioned the same way as X
convert_dtypebool, optional (default = False): When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.
broadcast_databool, optional (default = False): When set to True, the whole dataset is broadcasted to train the workers, otherwise each worker is trained on its partition

get_params(deep=True)[source]#

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters:

deepboolean (default = True)

predict(X, threshold=0.5, convert_dtype=True, predict_model='deprecated', layout='depth_first', default_chunk_size=None, align_bytes=None, delayed=True, broadcast_data=False)[source]#

Predicts the labels for X.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
thresholdfloat (default = 0.5): Threshold used for classification.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. The default of predict_model="GPU" should suffice in all situations. When inferring on small datasets you may also want to try setting broadcast_data=True.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.
broadcast_databool (default = False): If False, the trees are merged in a single model before the workers perform inference on their share of the prediction workload. When True, trees aren’t merged. Instead each worker infers on the whole prediction workload using its available trees. The results are reduced on the client. May be advantageous when the model is larger than the data used for inference.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): The predicted class labels.

predict_proba(X, delayed=True, **kwargs)[source]#

Predicts the probability of each class for X.

See documentation of predict for notes on performance.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (True) or an eager prediction (False)
**kwargsdict: Additional predict parameters passed to the underlying model’s predict method. See RandomForestClassifier.predict_proba documentation for a full list.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)

set_params(**params)[source]#

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters:

paramsdict of new params.

class cuml.dask.ensemble.RandomForestRegressor(*, workers=None, client=None, verbose=False, n_estimators=100, random_state=None, ignore_empty_partitions=False, **kwargs)[source]#

Currently, this API makes the following assumptions:

The set of Dask workers used between instantiation, fit, and predict are all consistent
Training data comes in the form of cuDF dataframes or Dask Arrays distributed so that each worker has at least one partition.

Future versions of the API will support more flexible data distribution and additional input types. User-facing APIs are expected to change in upcoming versions.

Please check the single-GPU implementation of Random Forest classifier for more information about the underlying algorithm.

Parameters:

n_estimatorsint (default = 100)

total number of trees in the forest (not per-worker)

handlecuml.Handle

split_criterionint or string (default = 2 ('mse'))

The criterion used to split nodes.

0 or 'gini' for gini impurity

1 or 'entropy' for information gain (entropy)

2 or 'mse' for mean squared error

4 or 'poisson' for poisson half deviance

5 or 'gamma' for gamma half deviance

6 or 'inverse_gaussian' for inverse gaussian deviance

0, 'gini', 1, 'entropy' not valid for regression

bootstrapboolean (default = True)

Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement.

If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_featuresfloat (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

If type int then max_features is the absolute count of features to be used.

If type float then max_features is a fraction.

If 'auto' then max_features=n_features = 1.0.

If 'sqrt' then max_features=1/sqrt(n_features).

If 'log2' then max_features=log2(n_features)/n_features.

If None, then max_features = 1.0.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

If type int, then min_samples_leaf represents the minimum number.

If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

If type int, then min_samples_split represents the minimum number.

If type float, then min_samples_split represents a fraction and ceil(min_samples_split * n_rows) is the minimum number of samples for each split.

accuracy_metricstring (default = ‘deprecated’)

Decides the metric used to evaluate the performance of the model.

for r-squared : 'r2' (default)
for median of abs error : 'median_ae'
for mean of abs error : 'mean_ae'
for mean square error’ : 'mse'

n_streamsint (default = 4 )

Number of parallel streams used for forest building

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

ignore_empty_partitions: Boolean (default = False)

Methods

`fit`(X, y[, convert_dtype, broadcast_data])	Fit the input data with a Random Forest regression model
`get_params`([deep])	Returns the value of all parameters required to configure this estimator as a dictionary.
`predict`(X[, convert_dtype, predict_model, ...])	Predicts the regressor outputs for X.
`set_params`(**params)	Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

partial_inference

fit(X, y, convert_dtype=False, broadcast_data=False)[source]#

Fit the input data with a Random Forest regression model

IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).

When persisting data, you can use cuml.dask.common.utils.persist_across_workers to simplify this:

X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client,
                                                  [X_dask_cudf,
                                                   y_dask_cudf])

This is equivalent to calling persist with the data and workers):

X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,
                                                y_dask_cudf],
                                               workers={
                                               X_dask_cudf:workers,
                                               y_dask_cudf:workers
                                               })

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): Labels of training examples. y must be partitioned the same way as X
convert_dtypebool, optional (default = False): When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
broadcast_databool, optional (default = False): When set to True, the whole dataset is broadcasted to train the workers, otherwise each worker is trained on its partition

get_params(deep=True)[source]#

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters:

deepboolean (default = True)

predict(X, convert_dtype=True, predict_model='deprecated', layout='depth_first', default_chunk_size=None, align_bytes=None, delayed=True, broadcast_data=False)[source]#

Predicts the regressor outputs for X.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
predict_modelstring (default = ‘deprecated’): Deprecated since version 25.10: predict_model is deprecated (and ignored) and will be removed in 25.12. The default of predict_model="GPU" should suffice in all situations. When inferring on small datasets you may also want to try setting broadcast_data=True.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.
broadcast_databool (default = False): If False, the trees are merged in a single model before the workers perform inference on their share of the prediction workload. When True, trees aren’t merged. Instead each worker infers on the whole prediction workload using its available trees. The results are reduced on the client. May be advantageous when the model is larger than the data used for inference.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

set_params(**params)[source]#

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters:

paramsdict of new params.

Truncated SVD#

class cuml.dask.decomposition.TruncatedSVD(*, client=None, **kwargs)[source]#

Parameters:

handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
n_componentsint (default = 1): The number of top K singular vectors / values you want. Must be <= number(columns).
svd_solver‘full’, ‘jacobi’: Only Full algorithm is supported since it’s significantly faster on GPU then the other solvers including randomized SVD.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Attributes:

components_array: The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
explained_variance_array: How much each component explains the variance in the data given by S**2
explained_variance_ratio_array: How much in % the variance is explained given by S**2/sum(S**2)
singular_values_array: The top K singular values. Remember all singular values >= 0

Methods

`fit`(X[, _transform])	Fit the model with X.
`fit_transform`(X)	Fit the model with X and apply the dimensionality reduction on X.
`inverse_transform`(X[, delayed])	Transform data back to its original space.
`transform`(X[, delayed])	Apply dimensionality reduction to `X`.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client, wait
>>> import cupy as cp
>>> from cuml.dask.decomposition import TruncatedSVD
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> nrows = 6
>>> ncols = 3
>>> n_parts = 2

>>> X_cudf, _ = make_blobs(n_samples=nrows, n_features=ncols,
...                        centers=1, n_parts=n_parts,
...                        cluster_std=1.8, random_state=10,
...                        dtype=cp.float32)
>>> in_blobs = X_cudf.compute()
>>> print(in_blobs)
[[ 6.953966    6.2313757   0.84974563]
[10.012338    3.4641726   3.0827546 ]
[ 9.537406    4.0504313   3.2793145 ]
[ 8.32713     2.957846    1.8215517 ]
[ 5.7044296   1.855514    3.7996366 ]
[10.089077    2.1995444   2.2072687 ]]
>>> cumlModel = TruncatedSVD(n_components = 1)
>>> XT = cumlModel.fit_transform(X_cudf)
>>> result = XT.compute()
>>> print(result)
[[ 8.699628   0.         0.       ]
[11.018815   0.         0.       ]
[10.8554535  0.         0.       ]
[ 9.000192   0.         0.       ]
[ 6.7628784  0.         0.       ]
[10.40526    0.         0.       ]]
>>> client.close()
>>> cluster.close()

fit(X, _transform=False)[source]#

Fit the model with X.

Parameters:

Xdask cuDF input

fit_transform(X)[source]#

Fit the model with X and apply the dimensionality reduction on X.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF

inverse_transform(X, delayed=True)[source]#

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters:

Xdask cuDF

Returns:

X_originaldask cuDF

transform(X, delayed=True)[source]#

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters:

Xdask cuDF

Returns:

X_newdask cuDF

Manifold#

class cuml.dask.manifold.UMAP(*, model, client=None, **kwargs)[source]#

Uniform Manifold Approximation and Projection

Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from lmcinnes/umap umap.py

Methods

transform(X[, convert_dtype])

Transform X into the existing embedded space and return that transformed output.

Notes

This module is heavily based on Leland McInnes’ reference UMAP package [1].

However, there are a number of differences and features that are not yet implemented in cuml.umap:

Using a non-Euclidean distance metric (support for a fixed set of non-Euclidean metrics is planned for an upcoming release).
Using a pre-computed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP.

Known issue: If a UMAP model has not yet been fit, it cannot be pickled

References

[1]

Leland McInnes, John Healy, James Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> import dask.array as da
>>> from cuml.datasets import make_blobs
>>> from cuml.manifold import UMAP
>>> from cuml.dask.manifold import UMAP as MNMG_UMAP
>>> import numpy as np

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> X, y = make_blobs(1000, 10, centers=42, cluster_std=0.1,
...                   dtype=np.float32, random_state=10)

>>> local_model = UMAP(random_state=10, verbose=0)

>>> selection = np.random.RandomState(10).choice(1000, 100)
>>> X_train = X[selection]
>>> y_train = y[selection]
>>> local_model.fit(X_train, y=y_train)
UMAP()

>>> distributed_model = MNMG_UMAP(model=local_model)
>>> distributed_X = da.from_array(X, chunks=(500, -1))
>>> embedding = distributed_model.transform(distributed_X)
>>> result = embedding.compute()
>>> print(result)
[[  4.1684933    4.1890593 ]
[  5.0110254   -5.2143383 ]
[  1.7776365  -17.665699  ]
...
[ -6.6378727   -0.15353012]
[ -3.1891193   -0.83906937]
[ -0.5042019    2.1454725 ]]
>>> client.close()
>>> cluster.close()

transform(X, convert_dtype=True)[source]#

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: lmcinnes/umap#158

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): New data to be transformed. Acceptable formats: dask cuDF, dask CuPy/NumPy/Numba Array

Returns:

X_newarray, shape (n_samples, n_components): Embedding of the new data in low-dimensional space.

Linear Models#

class cuml.dask.linear_model.LinearRegression(*, client=None, verbose=False, **kwargs)[source]#

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s dask Linear Regression (multi-node multi-gpu) expects dask cuDF DataFrame and provides an algorithms, Eig, to fit a linear model. And provides an eigendecomposition-based algorithm to fit a linear model. (SVD, which is more stable than eig, will be added in an upcoming version.) Eig algorithm is usually preferred when the X is a tall and skinny matrix. As the number of features in X increases, the accuracy of Eig algorithm drops.

This is an experimental implementation of dask Linear Regression. It supports input X that has more than one column. Single column input X will be supported after SVD algorithm is added in an upcoming version.

Parameters:

algorithm‘eig’: Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.
fit_interceptboolean (default = True): LinearRegression adds an additional term c to correct for the global mean of y, modeling the response as “x * beta + c”. If False, the model expects that you have centered the data.
normalizeboolean (default = False): If True, the predictors in X will be normalized by dividing by its L2 norm. If False, no scaling will be done.

Attributes:

coef_cuDF series, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Methods

`fit`(X, y)	Fit the model with X and y.
`predict`(X[, delayed])	Make predictions for X and returns a dask collection.

fit(X, y)[source]#

Fit the model with X and y.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Features for regression
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): Labels (outcome values)

predict(X, delayed=True)[source]#

Make predictions for X and returns a dask collection.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

class cuml.dask.linear_model.Ridge(*, client=None, verbose=False, **kwargs)[source]#

cuML’s dask Ridge (multi-node multi-gpu) expects dask cuDF DataFrame and provides an algorithms, Eig, to fit a linear model. And provides an eigendecomposition-based algorithm to fit a linear model. (SVD, which is more stable than eig, will be added in an upcoming version) Eig algorithm is usually preferred when the X is a tall and skinny matrix. As the number of features in X increases, the accuracy of Eig algorithm drops.

This is an experimental implementation of dask Ridge Regression. It supports input X that has more than one column. Single column input X will be supported after SVD algorithm is added in an upcoming version.

Parameters:

alphafloat (default = 1.0): Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
solver{‘eig’}: Eig uses a eigendecomposition of the covariance matrix, and is much faster. Other solvers will be supported in the future.
fit_interceptboolean (default = True): If True, Ridge adds an additional term c to correct for the global mean of y, modeling the response as “x * beta + c”. If False, the model expects that you have centered the data.
normalizeboolean (default = False): If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.

Methods

`fit`(X, y)	Fit the model with X and y.
`predict`(X[, delayed])	Make predictions for X and returns a dask collection.

fit(X, y)[source]#

Fit the model with X and y.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Features for regression
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): Labels (outcome values)

predict(X, delayed=True)[source]#

Make predictions for X and returns a dask collection.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

class cuml.dask.linear_model.Lasso(*, client=None, **kwargs)[source]#

cuML’s Lasso an array-like object or cuDF DataFrame and uses coordinate descent to fit a linear model.

Parameters:

alphafloat (default = 1.0): Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.
fit_interceptboolean (default = True): If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalizeboolean (default = False): If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
max_iterint (default = 1000): The maximum number of iterations
tolfloat (default = 1e-3): The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
selection{‘cyclic’, ‘random’} (default=’cyclic’): If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.
For additional docs, see `scikitlearn’s Lasso
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_.

Methods

`fit`(X, y)	Fit the model with X and y.
`predict`(X[, delayed])	Predicts the y for X.

fit(X, y)[source]#

Fit the model with X and y.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).
yDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).

predict(X, delayed=True)[source]#

Predicts the y for X.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

yDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).

class cuml.dask.linear_model.ElasticNet(*, client=None, **kwargs)[source]#

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters:

alphafloat (default = 1.0): Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
l1_ratio: float (default = 0.5): The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
fit_interceptboolean (default = True): If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalizeboolean (default = False): If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
max_iterint (default = 1000): The maximum number of iterations
tolfloat (default = 1e-3): The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
selection{‘cyclic’, ‘random’} (default=’cyclic’): If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

coef_array, shape (n_features): The estimated coefficients for the linear regression model.
intercept_array: The independent term. If fit_intercept is False, will be 0.
For additional docs, see `scikitlearn’s ElasticNet
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html>`_.

Methods

`fit`(X, y)	Fit the model with X and y.
`predict`(X[, delayed])	Predicts the y for X.

fit(X, y)[source]#

Fit the model with X and y.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).
yDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).

predict(X, delayed=True)[source]#

Predicts the y for X.

Parameters:

XDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

yDask cuDF DataFrame or CuPy backed Dask Array: Dense matrix (floats or doubles) of shape (n_samples, n_features).

Naive Bayes#

class cuml.dask.naive_bayes.MultinomialNB(*, client=None, verbose=False, **kwargs)[source]#

Distributed Naive Bayes classifier for multinomial models

Methods

`fit`(X, y[, classes])	Fit distributed Naive Bayes classifier model
`predict`(X)	Use distributed Naive Bayes model to predict the classes for a given set of data samples.
`score`(X, y)	Compute accuracy score

Examples

Load the 20 newsgroups dataset from Scikit-learn and train a Naive Bayes classifier.

>>> import cupy as cp

>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import CountVectorizer

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> import dask
>>> from cuml.dask.common import to_sparse_dask_array
>>> from cuml.dask.naive_bayes import MultinomialNB

>>> # Create a local CUDA cluster
>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)

>>> # Load corpus
>>> twenty_train = fetch_20newsgroups(subset='train',
...                           shuffle=True, random_state=42)

>>> cv = CountVectorizer()
>>> xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
>>> X = to_sparse_dask_array(xformed, client)
>>> y = dask.array.from_array(twenty_train.target, asarray=False,
...                       fancy=False).astype(cp.int32)

>>> # Train model
>>> model = MultinomialNB()
>>> model.fit(X, y)
<cuml.dask.naive_bayes.naive_bayes.MultinomialNB object at 0x...>

>>> # Compute accuracy on training set
>>> model.score(X, y)
array(0.924...)
>>> client.close()
>>> cluster.close()

fit(X, y, classes=None)[source]#

Fit distributed Naive Bayes classifier model

Parameters:

Xdask.Array with blocks containing dense or sparse cupy arrays
ydask.Array with blocks containing cupy.ndarray
classesarray-like containing unique class labels

Returns:

cuml.dask.naive_bayes.MultinomialNB current model instance

predict(X)[source]#

Use distributed Naive Bayes model to predict the classes for a given set of data samples.

Parameters:

Xdask.Array with blocks containing dense or sparse cupy arrays

Returns:

dask.Array containing predicted classes

score(X, y)[source]#

Compute accuracy score

Parameters:

XDask.Array: Features to predict. Note- it is assumed that chunk sizes and shape of X are known. This can be done for a fully delayed Array by calling X.compute_chunks_sizes()
yDask.Array: Labels to use for computing accuracy. Note- it is assumed that chunk sizes and shape of X are known. This can be done for a fully delayed Array by calling X.compute_chunks_sizes()

Returns:

scorefloat the resulting accuracy score

Solvers#

class cuml.dask.solvers.CD(*, client=None, **kwargs)[source]#

Model-Parallel Multi-GPU Linear Regression Model.

Methods

`fit`(X, y)	Fit the model with X and y.
`predict`(X[, delayed])	Make predictions for X and returns a dask collection.

fit(X, y)[source]#

Fit the model with X and y.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Features for regression
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1): Labels (outcome values)

predict(X, delayed=True)[source]#

Make predictions for X and returns a dask collection.

Parameters:

XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features): Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
delayedbool (default = True): Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns:

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Dask Base Classes and Mixins#

class cuml.dask.common.base.BaseEstimator(*, client=None, verbose=False, **kwargs)[source]#

Methods

get_combined_model()

Return single-GPU model for serialization

get_combined_model()[source]#

Return single-GPU model for serialization

Returns:

modelTrained single-GPU model or None if the model has not: yet been trained.

class cuml.dask.common.base.DelayedParallelFunc[source]#

class cuml.dask.common.base.DelayedPredictionMixin[source]#

class cuml.dask.common.base.DelayedTransformMixin[source]#

class cuml.dask.common.base.DelayedInverseTransformMixin[source]#

cuml.accel#

cuml.accel.install(disable_uvm: bool = False, log_level: Literal['error', 'warn', 'info', 'debug'] = 'warn') → None[source]#

Enable cuml.accel.

Parameters:

disable_uvmbool, optional: Whether to disable UVM.
log_level{“error”, “warn”, “info”, “debug”}, optional: The log level to set for the cuml.accel logger. Defaults to "warn", set to "info" or "debug" to get more information about what methods cuml.accel accelerated for a given run.

cuml.accel.enabled() → bool[source]#: Returns whether the accelerator is enabled.

cuml.accel.profile(quiet: bool = False) → Iterator[ProfileResults][source]#

Profile a section of code.

This will collect stats on all accelerated (or potentially-accelerated) method and function calls within the context, and output a report summarizing what methods cuml.accel was able to accelerate, and what methods required a CPU fallback.

cuml.accel.profile provides programmatic access to this profiler. Alternatively, you may use the --profile flag when running under the CLI, or the %cuml.accel.profile IPython magic when running in IPython or a notebook environment.

Parameters:

quietbool, optional: Set to True to skip printing the report automatically upon exiting the context.

Returns:

resultsProfileResults: A record of the profile results within the context.

Examples

As part of cuml.accel, the profiler only works if the accelerator is installed. You may accomplish this programmatically with cuml.accel.install, or through an alternative method like the CLI (python -m cuml.accel) or the IPython magic (%load_ext cuml.accel).

>>> import cuml
>>> cuml.accel.install()

Once The accelerator is active, you’re free to start running some scikit-learn code. The profiler helps you understand when cuml.accel was able to accelerate your code, and when it needed to fallback to CPU.

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import Ridge

To profile only certain sections of your code, wrap them in a profile contextmanager.

>>> with cuml.accel.profile():
...     X, y = make_regression()
...     model = Ridge()
...     model.fit(X, y)
...     model.predict(X)
...
cuml.accel profile
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Function      ┃ GPU calls ┃ GPU time ┃ CPU calls ┃ CPU time ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Ridge.fit     │         1 │    167ms │         0 │       0s │
│ Ridge.predict │         1 │    1.2ms │         0 │       0s │
├───────────────┼───────────┼──────────┼───────────┼──────────┤
│ Total         │         2 │  168.2ms │         0 │       0s │
└───────────────┴───────────┴──────────┴───────────┴──────────┘

cuml.accel.is_proxy(instance_or_class) → bool[source]#: Check if an instance or class is a proxy object created by the accelerator.

Experimental#

Warning

The cuml.experimental module contains features that are still under development. It is not recommended to depend on features in this module as they may change in future releases.

Note

Due to the nature of this module, it is not imported by default by the root cuml package. Each experimental submodule must be imported separately.

Linear Models#

class cuml.experimental.linear_model.Lars(*, fit_intercept=True, normalize=True, handle=None, verbose=False, output_type=None, copy_X=True, fit_path=True, n_nonzero_coefs=500, eps=None, precompute='auto')#

Least Angle Regression

Least Angle Regression (LAR or LARS) is a model selection algorithm. It builds up the model using the following algorithm:

We start with all the coefficients equal to zero.
At each step we select the predictor that has the largest absolute correlation with the residual.
We take the largest step possible in the direction which is equiangular with all the predictors selected so far. The largest step is determined such that using this step a new predictor will have as much correlation with the residual as any of the currently active predictors.
Stop if max_iter reached or all the predictors are used, or if the correlation between any unused predictor and the residual is lower than a tolerance.

The solver is based on [1]. The equations referred in the comments correspond to the equations in the paper.

Note

This algorithm assumes that the offset is removed from X and y, and each feature is normalized:

\[sum_i y_i = 0, sum_i x_{i,j} = 0,sum_i x_{i,j}^2=1 for j=0..n_{col}-1\]

Parameters:

fit_interceptboolean (default = True): If True, Lars tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalizeboolean (default = False): This parameter is ignored when fit_intercept is set to False. If True, the predictors in X will be normalized by removing its mean and dividing by it’s variance. If False, then the solver expects that the data is already normalized.

Changed in version 24.06: The default of normalize changed from True to False.
copy_Xboolean (default = True): The solver permutes the columns of X. Set copy_X to True to prevent changing the input data.
fit_pathboolean (default = True): Whether to return all the coefficients along the regularization path in the coef_path_ attribute.
precomputebool, ‘auto’, or array-like with shape = (n_features, n_features). (default = ‘auto’): Whether to precompute the Gram matrix. The user can provide the Gram matrix as an argument.
n_nonzero_coefsint (default 500): The maximum number of coefficients to fit. This gives an upper limit of how many features we select for prediction. This number is also an upper limit of the number of iterations.
handlecuml.Handle: Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
verboseint or boolean, default=False: Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.
output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None: Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

alphas_array of floats or doubles, shape = [n_alphas + 1]: The maximum correlation at each step.
active_array of ints shape = [n_alphas]: The indices of the active variables at the end of the path.
beta_array of floats or doubles [n_asphas]: The active regression coefficients (same as coef_ but zeros omitted).
coef_path_array of floats or doubles, shape = [n_alphas, n_alphas + 1]: The coefficients along the regularization path. Stored only if fit_path is True. Note that we only store coefficients for indices in the active set (i.e. coef_path_[:,-1] == coef_[active_])
coef_array, shape (n_features): The estimated coefficients for the regression model.
intercept_scalar, float or double: The independent term. If fit_intercept_ is False, will be 0.
n_iter_int: The number of iterations taken by the solver.

Methods

`fit`(self, X, y[, convert_dtype])	Fit the model with X and y.
`predict`(self, X[, convert_dtype])	Predicts `y` values for `X`.

Notes

For additional information, see scikitlearn’s OLS documentation.

References

[1]

B. Efron, T. Hastie, I. Johnstone, R Tibshirani, Least Angle Regression The Annals of Statistics (2004) Vol 32, No 2, 407-499

fit(self, X, y, convert_dtype=True) → 'Lars'[source]#

Fit the model with X and y. Parameters ———-

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, convert_dtype=True) → CumlArray[source]#

Predicts y values for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns:

y: cuDF DataFrame: Dense vector (floats or doubles) of shape (n_samples, 1)

Model Explainability#

class cuml.explainer.TreeExplainer(model, *, data=None, convert_dtype=True)#

Model explainer that calculates Shapley values for the predictions of tree-based models. Shapley values are a method of attributing various input features to a given model prediction.

Uses GPUTreeShap [1] as a back-end to accelerate computation using GPUs.

Different variants of Shapley values exist based on different interpretations of marginalising out (or conditioning on) features. For the “tree_path_dependent” approach, see [2].

For the “interventional” approach, see [3].

We also provide two variants of feature interactions. For the “shapley-interactions” variant of interactions, see [2], for the “shapley-taylor” variant, see [4].

[1]

Mitchell, Rory, Eibe Frank, and Geoffrey Holmes. “GPUTreeShap: massively parallel exact calculation of SHAP scores for tree ensembles.” PeerJ Computer Science 8 (2022): e880.

[2] (1,2)

Lundberg, Scott M., et al. “From local explanations to global understanding with explainable AI for trees.” Nature machine intelligence 2.1 (2020): 56-67.

[3]

Janzing, Dominik, Lenon Minorics, and Patrick Blöbaum. “Feature relevance quantification in explainable AI: A causal problem.” International Conference on artificial intelligence and statistics. PMLR, 2020.

[4]

Sundararajan, Mukund, Kedar Dhamdhere, and Ashish Agarwal. “The Shapley Taylor Interaction Index.” International Conference on Machine Learning. PMLR, 2020.

Parameters:

modelmodel object: The tree based machine learning model. XGBoost, LightGBM, cuml random forest and sklearn random forest models are supported. Categorical features in XGBoost or LightGBM models are natively supported.
dataarray or DataFrame: Optional background dataset to use for marginalising out features. If this argument is supplied, an “interventional” approach is used. Computation time increases with the size of this background data set, consider starting with between 100-1000 examples. If this argument is not supplied, statistics from the tree model are used to marginalise out features (“tree_path_dependent”).

Attributes:

expected_value: expected_value: object

Methods

`shap_interaction_values`(self, X[, method, ...])	Estimate the SHAP interaction values for a set of samples.
`shap_values`(self, X[, convert_dtype])	Estimate the SHAP values for a set of samples.

Examples

>>> import numpy as np
>>> import cuml
>>> from cuml.explainer import TreeExplainer
>>> X = np.array([[0.0, 2.0], [1.0, 0.5]])
>>> y = np.array([0, 1])
>>> model = cuml.ensemble.RandomForestRegressor().fit(X, y)
>>> explainer = TreeExplainer(model=model)
>>> shap_values = explainer.shap_values(X)

expected_value#: expected_value: object

shap_interaction_values(self, X, method='shapley-interactions', convert_dtype=True) → CumlArray[source]#

Estimate the SHAP interaction values for a set of samples. For a given row, the SHAP values plus the expected_value attribute sum up to the raw model prediction. ‘Raw model prediction’ means before the application of a link function, for example, the SHAP values of an XGBoost binary classification are in the additive logit space as opposed to probability space.

Interventional feature marginalisation is not supported.

Parameters:

X: A matrix of samples (# samples x # features) on which to explain the model’s output.
method: One of [‘shapley-interactions’, ‘shapley-taylor’]

Returns:

array: Returns a matrix of SHAP values of shape (# classes x # samples x # features x # features).

shap_values(self, X, convert_dtype=True) → CumlArray[source]#

Estimate the SHAP values for a set of samples. For a given row, the SHAP values plus the expected_value attribute sum up to the raw model prediction. ‘Raw model prediction’ means before the application of a link function, for example, the SHAP values of an XGBoost binary classification will be in the additive logit space as opposed to probability space.

Parameters:

X: A matrix of samples (# samples x # features) on which to explain the model’s output.

Returns:

array: Returns a matrix of SHAP values of shape (# classes x # samples x # features).

`fit`(y)	Fit a LabelEncoder instance to a set of categories.
`fit_transform`(y)	Simultaneously fit and transform an input
`inverse_transform`(y)	Revert ordinal label to original label
`transform`(y)	Transform an input into its categorical keys.

`fit`(y)	Fit label binarizer
`fit_transform`(y)	Fit label binarizer and transform multi-class labels to their dummy-encoded representation.
`inverse_transform`(y, *[, threshold])	Transform binary labels back to original multi-class labels
`transform`(y)	Transform multi-class labels to their dummy-encoded representation labels.

`fit`(X[, y])	Fit OneHotEncoder to X.
`fit_transform`(X[, y])	Fit OneHotEncoder to X, then transform X.
`get_feature_names`([input_features])	Return feature names for output features.
`inverse_transform`(X)	Convert the data back to the original representation.
`transform`(X)	Transform X using one-hot encoding.

`fit`(x, y[, fold_ids])	Fit a TargetEncoder instance to a set of categories
`fit_transform`(x, y[, fold_ids])	Simultaneously fit and transform an input
`get_params`([deep])	Returns a dict of all params owned by this class.
`transform`(x)	Transform an input into its categorical keys.

`fit`(y)	Fit a LabelEncoder instance to a set of categories
`fit_transform`(y[, delayed])	Simultaneously fit and transform an input
`inverse_transform`(y[, delayed])	Convert the data back to the original representation.
`transform`(y[, delayed])	Transform an input into its categorical keys.

`fit`(X)	Fit a multi-node multi-gpu OneHotEncoder to X.
`inverse_transform`(X[, delayed])	Convert the data back to the original representation.
`transform`(X[, delayed])	Transform X using one-hot encoding.

`fit`(raw_documents[, y])	Build a vocabulary of all tokens in the raw documents.
`fit_transform`(raw_documents[, y])	Build the vocabulary and return document-term matrix.
`get_feature_names`()	Array mapping from feature integer indices to feature name.
`inverse_transform`(X)	Return terms per document with nonzero entries in X.
`transform`(raw_documents)	Transform documents to document-term matrix.

`fit`(X[, y])	This method only checks the input type and the model parameter.
`fit_transform`(X[, y])	Transform a sequence of documents to a document-term matrix.
`partial_fit`(X[, y])	Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
`transform`(raw_documents)	Transform documents to document-term matrix.

`fit`(raw_documents)	Learn vocabulary and idf from training set.
`fit_transform`(raw_documents[, y])	Learn vocabulary and idf, return document-term matrix.
`get_feature_names`()	Array mapping from feature integer indices to feature name.
`transform`(raw_documents)	Transform documents to document-term matrix.

`fit`(X[, y])	Fit distributed TFIDF Transformer
`fit_transform`(X[, y])	Fit distributed TFIDFTransformer and then transform the given set of data samples.
`transform`(X[, y])	Use distributed TFIDFTransformer to transform the given set of data samples.

`run_cpu`(data[, bench_args])	Runs the cpu-based algorithm's fit method on specified data
`run_cuml`(data[, bench_args])	Runs the cuml-based algorithm's fit method on specified data

API Reference#

Module Configuration#

Output Data Type Configuration#

Verbosity Levels#

Preprocessing, Metrics, and Utilities#

Model Selection and Data Splitting#

Feature and Label Encoding (Single-GPU)#

Feature Scaling and Normalization (Single-GPU)#

Other preprocessing methods (Single-GPU)#

Text Preprocessing (Single-GPU)#

Feature and Label Encoding (Dask-based Multi-GPU)#

Feature Extraction (Single-GPU)#

Feature Extraction (Dask-based Multi-GPU)#

Dataset Generation (Single-GPU)#

Dataset Generation (Dask-based Multi-GPU)#

Metrics (regression, classification, and distance)#

Metrics (clustering and manifold learning)#

Parameters#

Returns#

Benchmarking#

Regression and Classification#

Linear Regression#

Logistic Regression#

Ridge Regression#

Lasso Regression#

ElasticNet Regression#

Mini Batch SGD Classifier#

Mini Batch SGD Regressor#

Multiclass Classification#

Naive Bayes#

Stochastic Gradient Descent#

Random Forest#

Forest Inferencing#

Coordinate Descent#

Quasi-Newton#

Support Vector Machines#

Nearest Neighbors Classification#

Nearest Neighbors Regression#

Kernel Ridge Regression#

Clustering#

K-Means Clustering#

DBSCAN#

Agglomerative Clustering#

HDBSCAN#

Dimensionality Reduction and Manifold Learning#

Principal Component Analysis#

Incremental PCA#

Truncated SVD#

UMAP#

Random Projections#

TSNE#

Spectral Embedding#

Neighbors#

Nearest Neighbors#

Nearest Neighbors Classification#

Nearest Neighbors Regression#

Kernel Density Estimation#

Time Series#

HoltWinters#

ARIMA#

Model Explainability#

SHAP Kernel Explainer#

SHAP Permutation Explainer#

Multi-Node, Multi-GPU Algorithms#

DBSCAN Clustering#

K-Means Clustering#

Nearest Neighbors#

Principal Component Analysis#

Random Forest#

Truncated SVD#

Manifold#

Linear Models#

Naive Bayes#

Solvers#

Dask Base Classes and Mixins#

cuml.accel#

Experimental#

Linear Models#

Model Explainability#

This Page