cuML API Reference

Module Configuration

Output Data Type Configuration

cuml.common.memory_utils.set_global_output_type(output_type)[source]

Method to set cuML’s single GPU estimators global output type. It will be used by all estimators unless overriden in their initialization with their own output_type parameter. Can also be overriden by the context manager method using_output_type().

Parameters
output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)

Desired output type of results and attributes of the estimators.

  • 'input' will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

    Input type

    Output type

    cuDF DataFrame or Series

    cuDF DataFrame or Series

    NumPy arrays

    NumPy arrays

    Pandas DataFrame or Series

    NumPy arrays

    Numba device arrays

    Numba device arrays

    CuPy arrays

    CuPy arrays

    Other __cuda_array_interface__ objs

    CuPy arrays

  • 'cudf' will return cuDF Series for single dimensional results and DataFrames for the rest.

  • 'cupy' will return CuPy arrays.

  • 'numpy' will return NumPy arrays.

Notes

'cupy' and 'numba' options (as well as 'input' when using Numba and CuPy ndarrays for input) have the least overhead. cuDF add memory consumption and processing time needed to build the Series and DataFrames. 'numpy' has the biggest overhead due to the need to transfer data to CPU memory.

Examples

>>> import cuml
>>> import cupy as cp
>>>
>>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
>>> ary = cp.asarray(ary)
>>> prev_output_type = cuml.global_settings.output_type
>>> cuml.set_global_output_type('cudf')
>>> dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
>>> dbscan_float.fit(ary)
DBSCAN()
>>>
>>> # cuML output type
>>> dbscan_float.labels_
0    0
1    1
2    2
dtype: int32
>>> type(dbscan_float.labels_)
<class 'cudf.core.series.Series'>
>>> cuml.set_global_output_type(prev_output_type)
cuml.common.memory_utils.using_output_type(output_type)[source]

Context manager method to set cuML’s global output type inside a with statement. It gets reset to the prior value it had once the with code block is executer.

Parameters
output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)

Desired output type of results and attributes of the estimators.

  • 'input' will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

    Input type

    Output type

    cuDF DataFrame or Series

    cuDF DataFrame or Series

    NumPy arrays

    NumPy arrays

    Pandas DataFrame or Series

    NumPy arrays

    Numba device arrays

    Numba device arrays

    CuPy arrays

    CuPy arrays

    Other __cuda_array_interface__ objs

    CuPy arrays

  • 'cudf' will return cuDF Series for single dimensional results and DataFrames for the rest.

  • 'cupy' will return CuPy arrays.

  • 'numpy' will return NumPy arrays.

Examples

>>> import cuml
>>> import cupy as cp
>>>
>>> ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
>>> ary = cp.asarray(ary)
>>>
>>> with cuml.using_output_type('cudf'):
...     dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
...     dbscan_float.fit(ary)
...
...     print("cuML output inside 'with' context")
...     print(dbscan_float.labels_)
...     print(type(dbscan_float.labels_))
...
DBSCAN()
cuML output inside 'with' context
0    0
1    1
2    2
dtype: int32
<class 'cudf.core.series.Series'>
>>> # use cuml again outside the context manager
>>> dbscan_float2 = cuml.DBSCAN(eps=1.0, min_samples=1)
>>> dbscan_float2.fit(ary)
DBSCAN()
>>>
>>> # cuML default output
>>> dbscan_float2.labels_
array([0, 1, 2], dtype=int32)
>>> type(dbscan_float2.labels_)
<class 'cupy._core.core.ndarray'>

Verbosity Levels

cuML follows a verbosity model similar to Scikit-learn’s: The verbose parameter can be a boolean, or a numeric value, and higher numeric values mean more verbosity. The exact values can be set directly, or through the cuml.common.logger module, and they are:

Verbosity Levels

Numeric value

cuml.common.logger value

Verbosity level

0

cuml.common.logger.level_off

Disables all log messages

1

cuml.common.logger.level_critical

Enables only critical messages

2

cuml.common.logger.level_error

Enables all messages up to and including errors.

3

cuml.common.logger.level_warn

Enables all messages up to and including warnings.

4 or False

cuml.common.logger.level_info

Enables all messages up to and including information messages.

5 or True

cuml.common.logger.level_debug

Enables all messages up to and including debug messages.

6

cuml.common.logger.level_trace

Enables all messages up to and including trace messages.

Preprocessing, Metrics, and Utilities

Model Selection and Data Splitting

cuml.model_selection.train_test_split(X, y=None, test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, random_state: Optional[Union[int, cupy.random._generator.RandomState, numpy.random.mtrand.RandomState]] = None, stratify=None)[source]

Partitions device data into four collated objects, mimicking Scikit-learn’s train_test_split.

Parameters
Xcudf.DataFrame or cuda_array_interface compliant device array

Data to split, has shape (n_samples, n_features)

ystr, cudf.Series or cuda_array_interface compliant device array

Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X (if it is a cuDF DataFrame) containing the labels

train_sizefloat or int, optional

If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8

shufflebool, optional

Whether or not to shuffle inputs before splitting

random_stateint, CuPy RandomState or NumPy RandomState optional

If shuffle is true, seeds the generator. Unseeded by default

stratify: cudf.Series or cuda_array_interface compliant device array,

optional parameter. When passed, the input is split using this as column to startify on. Default=None

Returns
X_train, X_test, y_train, y_testcudf.DataFrame or array-like objects

Partitioned dataframes if X and y were cuDF objects. If y was provided as a column name, the column was dropped from X. Partitioned numba device arrays if X and y were Numba device arrays. Partitioned CuPy arrays for any other input.

Examples

>>> import cudf
>>> from cuml.model_selection import train_test_split
>>> # Generate some sample data
>>> df = cudf.DataFrame({'x': range(10),
...                      'y': [0, 1] * 5})
>>> print(f'Original data: {df.shape[0]} elements')
Original data: 10 elements
>>> # Suppose we want an 80/20 split
>>> X_train, X_test, y_train, y_test = train_test_split(df, 'y',
...                                                     train_size=0.8)
>>> print(f'X_train: {X_train.shape[0]} elements')
X_train: 8 elements
>>> print(f'X_test: {X_test.shape[0]} elements')
X_test: 2 elements
>>> print(f'y_train: {y_train.shape[0]} elements')
y_train: 8 elements
>>> print(f'y_test: {y_test.shape[0]} elements')
y_test: 2 elements

>>> # Alternatively, if our labels are stored separately
>>> labels = df['y']
>>> df = df.drop(['y'], axis=1)
>>> # we can also do
>>> X_train, X_test, y_train, y_test = train_test_split(df, labels,
...                                                     train_size=0.8)

Feature and Label Encoding (Single-GPU)

class cuml.preprocessing.LabelEncoder.LabelEncoder(*, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]

An nvcategory based implementation of ordinal label encoding

Parameters
handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

Converting a categorical implementation to a numerical one

>>> from cudf import DataFrame, Series
>>> from cuml.preprocessing import LabelEncoder
>>> data = DataFrame({'category': ['a', 'b', 'c', 'd']})

>>> # There are two functionally equivalent ways to do this
>>> le = LabelEncoder()
>>> le.fit(data.category)  # le = le.fit(data.category) also works
LabelEncoder()
>>> encoded = le.transform(data.category)

>>> print(encoded)
0    0
1    1
2    2
3    3
dtype: uint8

>>> # This method is preferred
>>> le = LabelEncoder()
>>> encoded = le.fit_transform(data.category)

>>> print(encoded)
0    0
1    1
2    2
3    3
dtype: uint8

>>> # We can assign this to a new column
>>> data = data.assign(encoded=encoded)
>>> print(data.head())
category  encoded
0         a        0
1         b        1
2         c        2
3         d        3

>>> # We can also encode more data
>>> test_data = Series(['c', 'a'])
>>> encoded = le.transform(test_data)
>>> print(encoded)
0    2
1    0
dtype: uint8

>>> # After train, ordinal label can be inverse_transform() back to
>>> # string labels
>>> ord_label = cudf.Series([0, 0, 1, 2, 1])
>>> str_label = le.inverse_transform(ord_label)
>>> print(str_label)
0    a
1    a
2    b
3    c
4    b
dtype: object

Methods

fit(y[, _classes])

Fit a LabelEncoder (nvcategory) instance to a set of categories

fit_transform(y[, z])

Simultaneously fit and transform an input

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

inverse_transform(y)

Revert ordinal label to original label

transform(y)

Transform an input into its categorical keys.

fit(y, _classes=None)[source]

Fit a LabelEncoder (nvcategory) instance to a set of categories

Parameters
ycudf.Series

Series containing the categories to be encoded. It’s elements may or may not be unique

_classesint or None.

Passed by the dask client when dask LabelEncoder is used.

Returns
selfLabelEncoder

A fitted instance of itself to allow method chaining

fit_transform(y: cudf.core.series.Series, z=None) cudf.core.series.Series[source]

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

inverse_transform(y: cudf.core.series.Series) cudf.core.series.Series[source]

Revert ordinal label to original label

Parameters
ycudf.Series, dtype=int32

Ordinal labels to be reverted

Returns
revertedcudf.Series

Reverted labels

transform(y: cudf.core.series.Series) cudf.core.series.Series[source]

Transform an input into its categorical keys.

This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.

Parameters
ycudf.Series

Input keys to be transformed. Its values should match the categories given to fit

Returns
encodedcudf.Series

The ordinally encoded input series

Raises
KeyError

if a category appears that was not seen in fit

class cuml.preprocessing.LabelBinarizer(*, neg_label=0, pos_label=1, sparse_output=False, handle=None, verbose=False, output_type=None)[source]

A multi-class dummy encoder for labels.

Parameters
neg_labelinteger (default=0)

label to be used as the negative binary label

pos_labelinteger (default=1)

label to be used as the positive binary label

sparse_outputbool (default=False)

whether to return sparse arrays for transformed output

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

Create an array with labels and dummy encode them

>>> import cupy as cp
>>> import cupyx
>>> from cuml.preprocessing import LabelBinarizer

>>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
...                     dtype=cp.int32)

>>> lb = LabelBinarizer()
>>> encoded = lb.fit_transform(labels)
>>> print(str(encoded))
[[1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0]
[0 1 0 0 0 0 0 0]]
>>> decoded = lb.inverse_transform(encoded)
>>> print(str(decoded))
[ 0  5 10  7  2  4  1  0  0  4  3  2  1]
Attributes
classes_

Methods

fit(y)

Fit label binarizer

fit_transform(y)

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

inverse_transform(y[, threshold])

Transform binary labels back to original multi-class labels

transform(y)

Transform multi-class labels to their dummy-encoded representation labels.

fit(y) cuml.preprocessing.label.LabelBinarizer[source]

Fit label binarizer

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]

Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
selfreturns an instance of self.
fit_transform(y) cuml.common.array_sparse.SparseCumlArray[source]

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]
Returns
arrarray with encoded labels
get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

inverse_transform(y, threshold=None) cuml.common.array.CumlArray[source]

Transform binary labels back to original multi-class labels

Parameters
yarray of shape [n_samples, n_classes]
thresholdfloat this value is currently ignored
Returns
arrarray with original labels
transform(y) cuml.common.array_sparse.SparseCumlArray[source]

Transform multi-class labels to their dummy-encoded representation labels.

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]
Returns
arrarray with encoded labels
cuml.preprocessing.label_binarize(y, classes, neg_label=0, pos_label=1, sparse_output=False) cuml.common.array_sparse.SparseCumlArray[source]

A stateless helper function to dummy encode multi-class labels.

Parameters
yarray-like of size [n_samples,] or [n_samples, n_classes]
classesthe set of unique classes in the input
neg_labelinteger the negative value for transformed output
pos_labelinteger the positive value for transformed output
sparse_outputbool whether to return sparse array
class cuml.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float32'>, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]

Encode categorical features as a one-hot numeric array. The input to this estimator should be a cuDF.DataFrame or a cupy.ndarray, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

Note

a one-hot encoding of y labels should use a LabelBinarizer instead.

Parameters
categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None, a dict or a list, default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • dict/list : drop[col] is the category in feature col that should be dropped.

sparsebool, default=True

This feature is not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223

dtypenumber type, default=np.float

Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Attributes
drop_idx_array of shape (n_features,)

drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature. None if all the transformed features will be retained.

Methods

fit(X[, y])

Fit OneHotEncoder to X.

fit_transform(X[, y])

Fit OneHotEncoder to X, then transform X.

get_feature_names([input_features])

Return feature names for output features.

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

inverse_transform(X)

Convert the data back to the original representation.

transform(X)

Transform X using one-hot encoding.

property categories_

Returns categories used for the one hot encoding in the correct order.

fit(X, y=None)[source]

Fit OneHotEncoder to X.

Parameters
XcuDF.DataFrame or cupy.ndarray, shape = (n_samples, n_features)

The data to determine the categories of each feature.

yNone

Ignored. This parameter exists for compatibility only.

Returns
self
fit_transform(X, y=None)[source]

Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).

Parameters
Xcudf.DataFrame or cupy.ndarray, shape = (n_samples, n_features)

The data to encode.

Returns
X_outsparse matrix if sparse=True else a 2-d array

Transformed input.

get_feature_names(input_features=None)[source]

Return feature names for output features.

Parameters
input_featureslist of str of shape (n_features,)

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns
output_feature_namesndarray of shape (n_output_features,)

Array of feature names.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

inverse_transform(X)[source]

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

The return type is the same as the type of the input used by the first call to fit on this estimator instance.

Parameters
Xarray-like or sparse matrix, shape [n_samples, n_encoded_features]

The transformed data.

Returns
X_trcudf.DataFrame or cupy.ndarray

Inverse transformed array.

transform(X)[source]

Transform X using one-hot encoding.

Parameters
Xcudf.DataFrame or cupy.ndarray

The data to encode.

Returns
X_outsparse matrix if sparse=True else a 2-d array

Transformed input.

class cuml.preprocessing.TargetEncoder.TargetEncoder(n_folds=4, smooth=0, seed=42, split_method='interleaved', output_type='auto', stat='mean')[source]

A cudf based implementation of target encoding [1], which converts one or mulitple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns Xs and the aggregated mean value of Y of each group is calculated to replace each value of Xs. Several optimizations are applied to prevent label leakage and parallelize the execution.

Parameters
n_foldsint (default=4)

Default number of folds for fitting training data. To prevent label leakage in fit, we split data into n_folds and encode one fold using the target variables of the remaining folds.

smoothint or float (default=0)

Count of samples to smooth the encoding. 0 means no smoothing.

seedint (default=42)

Random seed

split_method{‘random’, ‘continuous’, ‘interleaved’}, (default=’interleaved’)

Method to split train data into n_folds. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. ‘customize’: customize splitting by providing a fold_ids array in fit() or fit_transform() functions.

output_type{‘cupy’, ‘numpy’, ‘auto’}, default = ‘auto’

The data type of output. If ‘auto’, it matches input data.

stat{‘mean’,’var’}, default = ‘mean’

The statistic used in encoding, mean or variance of the target.

References

1

https://maxhalford.github.io/blog/target-encoding/

Examples

Converting a categorical implementation to a numerical one

>>> from cudf import DataFrame, Series
>>> from cuml.preprocessing import TargetEncoder
>>> train = DataFrame({'category': ['a', 'b', 'b', 'a'],
...                    'label': [1, 0, 1, 1]})
>>> test = DataFrame({'category': ['a', 'c', 'b', 'a']})

>>> encoder = TargetEncoder()
>>> train_encoded = encoder.fit_transform(train.category, train.label)
>>> test_encoded = encoder.transform(test.category)
>>> print(train_encoded)
[1. 1. 0. 1.]
>>> print(test_encoded)
[1.   0.75 0.5  1.  ]

Methods

fit(x, y[, fold_ids])

Fit a TargetEncoder instance to a set of categories

fit_transform(x, y[, fold_ids])

Simultaneously fit and transform an input

get_params([deep])

Returns a dict of all params owned by this class.

transform(x)

Transform an input into its categorical keys.

get_param_names

fit(x, y, fold_ids=None)[source]

Fit a TargetEncoder instance to a set of categories

Parameters
xcudf.Series or cudf.DataFrame or cupy.ndarray

categories to be encoded. It’s elements may or may not be unique

ycudf.Series or cupy.ndarray

Series containing the target variable.

fold_idscudf.Series or cupy.ndarray

Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns
——-
selfTargetEncoder

A fitted instance of itself to allow method chaining

fit_transform(x, y, fold_ids=None)[source]

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)

Parameters
xcudf.Series or cudf.DataFrame or cupy.ndarray

categories to be encoded. It’s elements may or may not be unique

ycudf.Series or cupy.ndarray

Series containing the target variable.

fold_idscudf.Series or cupy.ndarray

Series containing the indices of the customized folds. Its values should be integers in range [0, N-1] to split data into N folds. If None, fold_ids is generated based on split_method.

Returns
encodedcupy.ndarray

The ordinally encoded input series

get_params(deep=False)[source]

Returns a dict of all params owned by this class.

transform(x)[source]

Transform an input into its categorical keys.

This is intended for test data. For fitting and transforming the training data, prefer fit_transform.

Parameters
xcudf.Series

Input keys to be transformed. Its values doesn’t have to match the categories given to fit

Returns
encodedcupy.ndarray

The ordinally encoded input series

Text Preprocessing (Single-GPU)

class cuml.preprocessing.text.stem.PorterStemmer(mode='NLTK_EXTENSIONS')[source]

A word stemmer based on the Porter stemming algorithm.

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.

Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. Only below mode is supported currently PorterStemmer.NLTK_EXTENSIONS

  • Implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web.

Parameters
mode: Modes of stemming (Only supports (NLTK_EXTENSIONS) currently)

default(“NLTK_EXTENSIONS”)

Examples

>>> import cudf
>>> from cuml.preprocessing.text.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> word_str_ser =  cudf.Series(['revival','singing','adjustable'])
>>> print(stemmer.stem(word_str_ser))
0     reviv
1      sing
2    adjust
dtype: object

Methods

stem(word_str_ser)

Stem Words using Porter stemmer

stem(word_str_ser)[source]

Stem Words using Porter stemmer

Parameters
word_str_sercudf.Series

A string series of words to stem

Returns
stemmed_sercudf.Series

Stemmed words strings series

Feature and Label Encoding (Dask-based Multi-GPU)

class cuml.dask.preprocessing.LabelBinarizer(*, client=None, **kwargs)[source]

A distributed version of LabelBinarizer for one-hot encoding a collection of labels.

Examples

Create an array with labels and dummy encode them

>>> import cupy as cp
>>> import cupyx
>>> from cuml.dask.preprocessing import LabelBinarizer

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> import dask

>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)

>>> labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
...                     dtype=cp.int32)
>>> labels = dask.array.from_array(labels)

>>> lb = LabelBinarizer()
>>> encoded = lb.fit_transform(labels)
>>> print(encoded.compute())
[[1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0]
[0 0 1 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0]
[0 0 1 0 0 0 0 0]
[0 1 0 0 0 0 0 0]]
>>> decoded = lb.inverse_transform(encoded)
>>> print(decoded.compute())
[ 0  5 10  7  2  4  1  0  0  4  3  2  1]
>>> client.close()
>>> cluster.close()

Methods

fit(y)

Fit label binarizer

fit_transform(y)

Fit the label encoder and return transformed labels

inverse_transform(y[, threshold])

Invert a set of encoded labels back to original labels

transform(y)

Transform and return encoded labels

fit(y)[source]

Fit label binarizer

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]

chunked by row. Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
selfreturns an instance of self.
fit_transform(y)[source]

Fit the label encoder and return transformed labels

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]

target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
arrDask.Array backed by CuPy arrays containing encoded labels
inverse_transform(y, threshold=None)[source]

Invert a set of encoded labels back to original labels

Parameters
yDask.Array of shape [n_samples, n_classes] containing encoded

labels

thresholdfloat This value is currently ignored
Returns
arrDask.Array backed by CuPy arrays containing original labels
transform(y)[source]

Transform and return encoded labels

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]
Returns
arrDask.Array backed by CuPy arrays containing encoded labels
class cuml.dask.preprocessing.OneHotEncoder(*, client=None, verbose=False, **kwargs)[source]

Encode categorical features as a one-hot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

Parameters
categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’

Categories (unique values) per feature. All categories are expected to fit on one GPU.

  • ‘auto’ : Determine categories automatically from the training data.

  • DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None or a dict, default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • Dict : drop[col] is the category in feature col that should be dropped.

sparsebool, default=False

This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223

dtypenumber type, default=np.float

Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Methods

fit(X)

Fit a multi-node multi-gpu OneHotEncoder to X.

fit_transform(X[, delayed])

Fit OneHotEncoder to X, then transform X.

inverse_transform(X[, delayed])

Convert the data back to the original representation.

transform(X[, delayed])

Transform X using one-hot encoding.

fit(X)[source]

Fit a multi-node multi-gpu OneHotEncoder to X.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to determine the categories of each feature.

Returns
self
fit_transform(X, delayed=True)[source]

Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to encode.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
outDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed data

inverse_transform(X, delayed=True)[source]

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

Parameters
XCuPy backed Dask Array, shape [n_samples, n_encoded_features]

The transformed data.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
X_trDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the inverse transformed array.

transform(X, delayed=True)[source]

Transform X using one-hot encoding.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to encode.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
outDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed input.

Feature Extraction (Single-GPU)

class cuml.feature_extraction.text.CountVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]

Convert a collection of text documents to a matrix of token counts

If you do not provide an a-priori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.

Parameters
lowercaseboolean, True by default

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)

If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional

If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Attributes
vocabulary_cudf.Series[str]

Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]
Terms that were ignored because they either:
  • occurred in too many documents (max_df)

  • occurred in too few documents (min_df)

  • were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(raw_documents)

Build a vocabulary of all tokens in the raw documents.

fit_transform(raw_documents)

Build the vocabulary and return document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

inverse_transform(X)

Return terms per document with nonzero entries in X.

transform(raw_documents)

Transform documents to document-term matrix.

fit(raw_documents)[source]

Build a vocabulary of all tokens in the raw documents.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
self
fit_transform(raw_documents)[source]

Build the vocabulary and return document-term matrix.

Equivalent to self.fit(X).transform(X) but preprocess X only once.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
Xcupy csr array of shape (n_samples, n_features)

Document-term matrix.

get_feature_names()[source]

Array mapping from feature integer indices to feature name.

Returns
feature_namesSeries

A list of feature names.

inverse_transform(X)[source]

Return terms per document with nonzero entries in X.

Parameters
Xarray-like of shape (n_samples, n_features)

Document-term matrix.

Returns
X_invlist of cudf.Series of shape (n_samples,)

List of Series of terms.

transform(raw_documents)[source]

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
Xcupy csr array of shape (n_samples, n_features)

Document-term matrix.

class cuml.feature_extraction.text.HashingVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]

Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained

  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

  • no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

Parameters
lowercasebool, default=True

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, default=None

If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

n_featuresint, default=(2 ** 20)

The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

binarybool, default=False.

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

norm{‘l1’, ‘l2’}, default=’l2’

Norm used to normalize term vectors. None for no normalization.

alternate_signbool, default=True

When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

Examples

from cuml.feature_extraction.text import HashingVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = HashingVectorizer(n_features=2**4)
X = vectorizer.fit_transform(corpus)
print(X.shape)

Output:

(4, 16)

Methods

fit(X[, y])

This method only checks the input type and the model parameter.

fit_transform(X[, y])

Transform a sequence of documents to a document-term matrix.

partial_fit(X[, y])

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

transform(raw_documents)

Transform documents to document-term matrix.

fit(X, y=None)[source]

This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless

Parameters
Xcudf.Series

A Series of string documents

fit_transform(X, y=None)[source]

Transform a sequence of documents to a document-term matrix.

Parameters
Xiterable over raw text documents, length = n_samples

Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

yany

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns
Xsparse CuPy CSR matrix of shape (n_samples, n_features)

Document-term matrix.

partial_fit(X, y=None)[source]

Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.

Parameters
Xcudf.Series(A Series of string documents).
transform(raw_documents)[source]

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
Xsparse CuPy CSR matrix of shape (n_samples, n_features)

Document-term matrix.

class cuml.feature_extraction.text.TfidfVectorizer(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to CountVectorizer followed by TfidfTransformer.

Parameters
lowercaseboolean, True by default

Convert all characters to lowercase before tokenizing.

preprocessorcallable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

stop_wordsstring {‘english’}, list, or None (default)

If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.

ngram_rangetuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstring, {‘word’, ‘char’, ‘char_wb’}, default=’word’

Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

max_dffloat in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_dffloat in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_featuresint or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

vocabularycudf.Series, optional

If not given, a vocabulary is determined from the input documents.

binaryboolean, default=False

If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

dtypetype, optional

Type of the matrix returned by fit_transform() or transform().

delimiterstr, whitespace by default

String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.

norm{‘l1’, ‘l2’}, default=’l2’
Each output row will have unit norm, either:
  • ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.

  • ‘l1’: Sum of absolute values of vector elements is 1.

use_idfbool, default=True

Enable inverse-document-frequency reweighting.

smooth_idfbool, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

sublinear_tfbool, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

Notes

The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.

This class is largely based on scikit-learn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD-3 license.

Attributes
idf_array of shape (n_features)

The inverse document frequency (IDF) vector; only defined if use_idf is True.

vocabulary_cudf.Series[str]

Array mapping from feature integer indices to feature name.

stop_words_cudf.Series[str]
Terms that were ignored because they either:
  • occurred in too many documents (max_df)

  • occurred in too few documents (min_df)

  • were cut off by feature selection (max_features).

This is only available if no vocabulary was given.

Methods

fit(raw_documents)

Learn vocabulary and idf from training set.

fit_transform(raw_documents)

Learn vocabulary and idf, return document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

transform(raw_documents)

Transform documents to document-term matrix.

fit(raw_documents)[source]

Learn vocabulary and idf from training set.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
selfobject

Fitted vectorizer.

fit_transform(raw_documents)[source]

Learn vocabulary and idf, return document-term matrix. This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
Xcupy csr array of shape (n_samples, n_features)

Tf-idf-weighted document-term matrix.

get_feature_names()[source]

Array mapping from feature integer indices to feature name.

Returns
feature_namesSeries

A list of feature names.

transform(raw_documents)[source]

Transform documents to document-term matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Parameters
raw_documentscudf.Series

A Series of string documents

Returns
Xcupy csr array of shape (n_samples, n_features)

Tf-idf-weighted document-term matrix.

Feature Extraction (Dask-based Multi-GPU)

class cuml.dask.feature_extraction.text.TfidfTransformer(*, client=None, verbose=False, **kwargs)[source]

Distributed TF-IDF transformer

Examples

>>> import cupy as cp
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> from cuml.dask.common import to_sparse_dask_array
>>> from cuml.dask.naive_bayes import MultinomialNB
>>> import dask
>>> from cuml.dask.feature_extraction.text import TfidfTransformer

>>> # Create a local CUDA cluster
>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)

>>> # Load corpus
>>> twenty_train = fetch_20newsgroups(subset='train',
...                         shuffle=True, random_state=42)
>>> cv = CountVectorizer()
>>> xformed = cv.fit_transform(twenty_train.data).astype(cp.float32)
>>> X = to_sparse_dask_array(xformed, client)

>>> y = dask.array.from_array(twenty_train.target, asarray=False,
...                     fancy=False).astype(cp.int32)

>>> multi_gpu_transformer = TfidfTransformer()
>>> X_transformed = multi_gpu_transformer.fit_transform(X)
>>> X_transformed.compute_chunk_sizes()
dask.array<...>

>>> model = MultinomialNB()
>>> model.fit(X_transformed, y)
<cuml.dask.naive_bayes.naive_bayes.MultinomialNB object at 0x...>
>>> result = model.score(X_transformed, y)
>>> print(result) 
array(0.93264981)
>>> client.close()
>>> cluster.close()

Methods

fit(X)

Fit distributed TFIDF Transformer

fit_transform(X)

Fit distributed TFIDFTransformer and then transform the given set of data samples.

transform(X)

Use distributed TFIDFTransformer to transform the given set of data samples.

fit(X)[source]

Fit distributed TFIDF Transformer

Parameters
Xdask.Array with blocks containing dense or sparse cupy arrays
Returns
cuml.dask.feature_extraction.text.TfidfTransformer instance
fit_transform(X)[source]

Fit distributed TFIDFTransformer and then transform the given set of data samples.

Parameters
Xdask.Array with blocks containing dense or sparse cupy arrays
Returns
dask.Array with blocks containing transformed sparse cupy arrays
transform(X)[source]

Use distributed TFIDFTransformer to transform the given set of data samples.

Parameters
Xdask.Array with blocks containing dense or sparse cupy arrays
Returns
dask.Array with blocks containing transformed sparse cupy arrays

Dataset Generation (Single-GPU)

random_state

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

cuml.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=(- 10.0, 10.0), shuffle=True, random_state=None, return_centers=False, order='F', dtype='float32')[source]

Generate isotropic Gaussian blobs for clustering.

Parameters
n_samplesint or array-like, optional (default=100)

If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_featuresint, optional (default=2)

The number of features for each sample.

centersint or array of shape [n_centers, n_features], optional

(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat or sequence of floats, optional (default=1.0)

The standard deviation of the clusters.

center_boxpair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.

shuffleboolean, optional (default=True)

Shuffle the samples.

random_stateint, RandomState instance, default=None

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

return_centersbool, optional (default=False)

If True, then return the centers of each cluster

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

Returns
Xdevice array of shape [n_samples, n_features]

The generated samples.

ydevice array of shape [n_samples]

The integer labels for cluster membership of each sample.

centersdevice array, shape [n_centers, n_features]

The centers of each cluster. Only returned if return_centers=True.

See also

make_classification

a more intricate variant

Examples

>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
>>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
cuml.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', _centroids=None, _informative_covariance=None, _redundant_covariance=None, _repeated_indices=None)[source]

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=20)

The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)

The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)

The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)

The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,), (default=None)

The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)

The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)

The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)

If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)

Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)

Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

_centroids: array of centroids of shape (n_clusters, n_informative)
_informative_covariance: array for covariance between informative features

of shape (n_clusters, n_informative, n_informative)

_redundant_covariance: array for covariance between redundant features

of shape (n_informative, n_redundant)

_repeated_indices: array of indices for the repeated features

of shape (n_repeated, )

Returns
Xdevice array of shape [n_samples, n_features]

The generated samples.

ydevice array of shape [n_samples]

The integer labels for class membership of each sample.

Notes

The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset. How we optimized for GPUs:

  1. Firstly, we generate X from a standard univariate instead of zeros. This saves memory as we don’t need to generate univariates each time for each feature class (informative, repeated, etc.) while also providing the added speedup of generating a big matrix on GPU

  2. We generate order=F construction. We exploit the fact that X is a generated from a univariate normal, and covariance is introduced with matrix multiplications. Which means, we can generate X as a 1D array and just reshape it to the desired order, which only updates the metadata and eliminates copies

  3. Lastly, we also shuffle by construction. Centroid indices are permuted for each sample, and then we construct the data for each centroid. This shuffle works for both order=C and order=F and eliminates any need for secondary copies

References

1

I. Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Examples

>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=4,
...                            n_informative=2, n_classes=2,
...                            random_state=10)

>>> print(X) 
[[-1.7974224   0.24425316  0.39062843 -0.38293394]
[ 0.6358963   1.4161923   0.06970507 -0.16085647]
[-0.22802866 -1.1827322   0.3525861   0.276615  ]
[ 1.7308872   0.43080002  0.05048406  0.29837844]
[-1.9465544   0.5704457  -0.8997551  -0.27898186]
[ 1.0575483  -0.9171263   0.09529338  0.01173469]
[ 0.7917619  -1.0638094  -0.17599393 -0.06420116]
[-0.6686142  -0.13951421 -0.6074711   0.21645583]
[-0.88968956 -0.914443    0.1302423   0.02924336]
[-0.8817671  -0.84549576  0.1845096   0.02556021]]

>>> print(y)
[0 1 0 1 1 0 0 1 0 0]
cuml.datasets.make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None, dtype='single', handle=None) Union[Tuple[CumlArray, CumlArray], Tuple[CumlArray, CumlArray, CumlArray]][source]

Generate a random regression problem.

See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html # noqa: E501

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=2)

The number of features.

n_informativeint, optional (default=2)

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)

The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)
if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noisefloat, optional (default=0.0)

The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

coefboolean, optional (default=False)

If True, the coefficients of the underlying linear model are returned.

random_stateint, RandomState instance or None (default)

Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)

Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’.

handle: cuml.Handle

If it is None, a new one is created just for this function call

Returns
outdevice array of shape [n_samples, n_features]

The input samples.

valuesdevice array of shape [n_samples, n_targets]

The output values.

coefdevice array of shape [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

Examples

>>> from cuml.datasets.regression import make_regression
>>> from cuml.linear_model import LinearRegression

>>> # Create regression problem
>>> data, values = make_regression(n_samples=200, n_features=12,
...                                n_informative=7, bias=-4.2,
...                                noise=0.3, random_state=10)

>>> # Perform a linear regression on this problem
>>> lr = LinearRegression(fit_intercept = True, normalize = False,
...                       algorithm = "eig")
>>> reg = lr.fit(data, values)
>>> print(reg.coef_) 
[-2.6980877e-02  7.7027252e+01  1.1498465e+01  8.5468025e+00
5.8548538e+01  6.0772545e+01  3.6876743e+01  4.0023815e+01
4.3908358e-03 -2.0275116e-02  3.5066366e-02 -3.4512520e-02]
cuml.datasets.make_arima(batch_size=1000, n_obs=100, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0), intercept=False, random_state=None, dtype='double', handle=None)[source]

Generates a dataset of time series by simulating an ARIMA process of a given order.

Parameters
batch_size: int

Number of time series to generate

n_obs: int

Number of observations per series

orderTuple[int, int, int]

Order (p, d, q) of the simulated ARIMA process

seasonal_order: Tuple[int, int, int, int]

Seasonal ARIMA order (P, D, Q, s) of the simulated ARIMA process

intercept: bool or int

Whether to include a constant trend mu in the simulated ARIMA process

random_state: int, RandomState instance or None (default)

Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)

Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’

handle: cuml.Handle

If it is None, a new one is created just for this function call

Returns
out: array-like, shape (n_obs, batch_size)

Array of the requested type containing the generated dataset

Examples

from cuml.datasets import make_arima
y = make_arima(1000, 100, (2,1,2), (0,1,2,12), 0)

Dataset Generation (Dask-based Multi-GPU)

cuml.dask.datasets.blobs.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=(- 10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None)[source]

Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.

This function calls make_blobs from cuml.datasets on each Dask worker and aggregates them into a single Dask Dataframe.

For more information on Scikit-learn’s make_blobs:.

Parameters
n_samplesint

number of rows

n_featuresint

number of features

centersint or array of shape [n_centers, n_features],

optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat (default = 1.0)

standard deviation of points around centroid

n_partsint (default = None)

number of partitions to generate (this can be greater than the number of workers)

center_boxtuple (int, int) (default = (-10, 10))

the bounding box which constrains all the centroids

random_stateint (default = None)

sets random seed (or use None to reinitialize each time)

return_centersbool, optional (default=False)

If True, then return the centers of each cluster

verboseint or boolean (default = False)

Logging level.

shufflebool (default=False)

Shuffles the samples on each worker.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

clientdask.distributed.Client (optional)

Dask client to use

Returns
Xdask.array backed by CuPy array of shape [n_samples, n_features]

The input samples.

ydask.array backed by CuPy array of shape [n_samples]

The output values.

centersdask.array backed by CuPy array of shape

[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.

cuml.dask.datasets.classification.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', n_parts=None, client=None)[source]

Generate a random n-class classification problem.

This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2 * class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=20)

The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)

The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)

The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)

The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,) , (default=None)

The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)

The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)

The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)

If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)

Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)

Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

n_partsint (default = None)

number of partitions to generate (this can be greater than the number of workers)

Returns
Xdask.array backed by CuPy array of shape [n_samples, n_features]

The generated samples.

ydask.array backed by CuPy array of shape [n_samples]

The integer labels for class membership of each sample.

Notes

How we extended the dask MNMG version from the single GPU version:

  1. We generate centroids of shape (n_centroids, n_informative)

  2. We generate an informative covariance of shape (n_centroids, n_informative, n_informative)

  3. We generate a redundant covariance of shape (n_informative, n_redundant)

  4. We generate the indices for the repeated features We pass along the references to the futures of the above arrays with each part to the single GPU cuml.datasets.classification.make_classification so that each part (and worker) has access to the correct values to generate data from the same covariances

Examples

>>> from dask.distributed import Client
>>> from dask_cuda import LocalCUDACluster
>>> from cuml.dask.datasets.classification import make_classification
>>> cluster = LocalCUDACluster()
>>> client = Client(cluster)
>>> X, y = make_classification(n_samples=10, n_features=4,
...                            random_state=1, n_informative=2,
...                            n_classes=2)
>>> print(X.compute()) 
[[-1.1273878   1.2844919  -0.32349187  0.1595734 ]
[ 0.80521786 -0.65946865 -0.40753683  0.15538901]
[ 1.0404129  -1.481386    1.4241115   1.2664981 ]
[-0.92821544 -0.6805706  -0.26001272  0.36004275]
[-1.0392245  -1.1977317   0.16345565 -0.21848428]
[ 1.2273135  -0.529214    2.4799604   0.44108105]
[-1.9163864  -0.39505136 -1.9588828  -1.8881643 ]
[-0.9788184  -0.89851004 -0.08339313  0.1130247 ]
[-1.0549078  -0.8993015  -0.11921967  0.04821599]
[-1.8388828  -1.4063598  -0.02838472 -1.0874642 ]]
>>> print(y.compute()) 
[1 0 0 0 0 1 0 0 0 0]
>>> client.close()
>>> cluster.close()
cuml.dask.datasets.regression.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None, n_parts=1, n_samples_per_part=None, dtype='float32')[source]

Generate a mostly low rank matrix with bell-shaped singular values

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=100)

The number of features.

effective_rankint, optional (default=10)

The approximate number of singular vectors required to explain most of the data by linear combinations.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile.

random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)

The number of parts of work.

dtype: str, optional (default=’float32’)

dtype of generated data

Returns
XDask-CuPy array of shape [n_samples, n_features]

The matrix.

cuml.dask.datasets.regression.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=False, coef=False, random_state=None, n_parts=1, n_samples_per_part=None, order='F', dtype='float32', client=None, use_full_low_rank=True)[source]

Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.

The output is generated by applying a (potentially biased) random linear regression model with “n_informative” nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=100)

The number of features.

n_informativeint, optional (default=10)

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)

The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)
if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if “effective_rank” is not None.

noisefloat, optional (default=0.0)

The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=False)

Shuffle the samples and the features.

coefboolean, optional (default=False)

If True, the coefficients of the underlying linear model are returned.

random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)

The number of parts of work.

orderstr, optional (default=’F’)

Row-major or Col-major

dtype: str, optional (default=’float32’)

dtype of generated data

use_full_low_rankboolean (default=True)

Whether to use the entire dataset to generate the low rank matrix. If False, it creates a low rank covariance and uses the corresponding covariance to generate a multivariate normal distribution on the remaining chunks

Returns
XDask-CuPy array of shape [n_samples, n_features]

The input samples.

yDask-CuPy array of shape [n_samples] or [n_samples, n_targets]

The output values.

coefDask-CuPy array of shape [n_features] or [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

Notes

Known Performance Limitations:
  1. When effective_rank is set and use_full_low_rank is True, we cannot generate order F by construction, and an explicit transpose is performed on each part. This may cause memory to spike (other parameters make order F by construction)

  2. When n_targets > 1 and order = 'F' as above, we have to explicity transpose the y array. If coef = True, then we also explicity transpose the ground_truth array

  3. When shuffle = True and order = F, there are memory spikes to shuffle the F order arrays

Note

If out-of-memory errors are encountered in any of the above configurations, try increasing the n_parts parameter.

Array Wrappers (Internal API)

class cuml.common.CumlArray(data=None, index=None, owner=None, dtype=None, shape=None, order=None)[source]

Array represents an abstracted array allocation. It can be instantiated by itself, creating an rmm.DeviceBuffer underneath, or can be instantiated by __cuda_array_interface__ or __array_interface__ compliant arrays, in which case it’ll keep a reference to that data underneath. Also can be created from a pointer, specifying the characteristics of the array, in that case the owner of the data referred to by the pointer should be specified explicitly.

Parameters
datarmm.DeviceBuffer, cudf.Buffer, array_like, int, bytes, bytearray or memoryview

An array-like object or integer representing a device or host pointer to pre-allocated memory.

ownerobject, optional

Python object to which the lifetime of the memory allocation is tied. If provided, a reference to this object is kept in this Buffer.

dtypedata-type, optional

Any object that can be interpreted as a numpy or cupy data type.

shapeint or tuple of ints, optional

Shape of created array.

order: string, optional

Whether to create a F-major or C-major array.

Notes

cuml Array is not meant as an end-user array library. It is meant for cuML/RAPIDS developer consumption. Therefore it contains the minimum functionality. Its functionality is hidden by base.pyx to provide automatic output format conversion so that the users see the important attributes in whatever format they prefer.

Todo: support cuda streams in the constructor. See: https://github.com/rapidsai/cuml/issues/1712 https://github.com/rapidsai/cuml/pull/1396

Attributes
ptrint

Pointer to the data

sizeint

Size of the array data in bytes

_ownerPython Object

Object that owns the data of the array

shapetuple of ints

Shape of the array

order{‘F’, ‘C’}

‘F’ or ‘C’ to indicate Fortran-major or C-major order of the array

stridestuple of ints

Strides of the data

__cuda_array_interface__dictionary

__cuda_array_interface__ to interop with other libraries.

Methods

empty(shape, dtype[, order, index])

Create an empty Array with an allocated but uninitialized DeviceBuffer

full(shape, value, dtype[, order, index])

Create an Array with an allocated DeviceBuffer initialized to value.

ones(shape[, dtype, order, index])

Create an Array with an allocated DeviceBuffer initialized to zeros.

serialize()

Generate an equivalent serializable representation of an object.

to_output([output_type, output_dtype])

Convert array to output format

zeros(shape[, dtype, order, index])

Create an Array with an allocated DeviceBuffer initialized to zeros.

item

classmethod empty(shape, dtype, order='F', index=None)[source]

Create an empty Array with an allocated but uninitialized DeviceBuffer

Parameters
dtypedata-type, optional

Any object that can be interpreted as a numpy or cupy data type.

shapeint or tuple of ints, optional

Shape of created array.

order: string, optional

Whether to create a F-major or C-major array.

classmethod full(shape, value, dtype, order='F', index=None)[source]

Create an Array with an allocated DeviceBuffer initialized to value.

Parameters
dtypedata-type, optional

Any object that can be interpreted as a numpy or cupy data type.

shapeint or tuple of ints, optional

Shape of created array.

order: string, optional

Whether to create a F-major or C-major array.

classmethod ones(shape, dtype='float32', order='F', index=None)[source]

Create an Array with an allocated DeviceBuffer initialized to zeros.

Parameters
dtypedata-type, optional

Any object that can be interpreted as a numpy or cupy data type.

shapeint or tuple of ints, optional

Shape of created array.

order: string, optional

Whether to create a F-major or C-major array.

to_output(output_type='cupy', output_dtype=None)[source]

Convert array to output format

Parameters
output_typestring

Format to convert the array to. Acceptable formats are:

  • ‘cupy’ - to cupy array

  • ‘numpy’ - to numpy (host) array

  • ‘numba’ - to numba device array

  • ‘dataframe’ - to cuDF DataFrame

  • ‘series’ - to cuDF Series

  • ‘cudf’ - to cuDF Series if array is single dimensional, to

    DataFrame otherwise

output_dtypestring, optional

Optionally cast the array to a specified dtype, creating a copy if necessary.

classmethod zeros(shape, dtype='float32', order='F', index=None)[source]

Create an Array with an allocated DeviceBuffer initialized to zeros.

Parameters
dtypedata-type, optional

Any object that can be interpreted as a numpy or cupy data type.

shapeint or tuple of ints, optional

Shape of created array.

order: string, optional

Whether to create a F-major or C-major array.

Metrics (regression, classification, and distance)

cuml.metrics.regression.mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')[source]

Mean absolute error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

Returns
lossfloat or ndarray of floats

If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

cuml.metrics.regression.mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]

Mean squared error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’] (default=’uniform_average’)

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)

If True returns MSE value, if False returns RMSE value.

Returns
lossfloat or ndarray of floats

A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.mean_squared_log_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]

Mean squared log error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)

If True returns MSE value, if False returns RMSE value.

Returns
lossfloat or ndarray of floats

A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.r2_score(y, y_hat, convert_dtype=True, handle=None) double[source]

Calculates r2 score between y and y_hat

Parameters
yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y_hatarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y_hat to be the same data type as y if they differ. This will increase memory used for the method.

Returns
trustworthiness scoredouble

Trustworthiness of the low-dimensional embedding

cuml.metrics.accuracy.accuracy_score(ground_truth, predictions, handle=None, convert_dtype=True)[source]

Calcuates the accuracy score of a classification model.

Parameters
handlecuml.Handle
predictionNumPy ndarray or Numba device

The labels predicted by the model for the test dataset

ground_truthNumPy ndarray, Numba device

The ground truth labels of the test dataset

Returns
float

The accuracy of the model used for prediction

cuml.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None, normalize=None, convert_dtype=False) cuml.common.array.CumlArray[source]

Compute confusion matrix to evaluate the accuracy of a classification.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

labelsarray-like (device or host) shape = (n_classes,), optional

List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

normalizestring in [‘true’, ‘pred’, ‘all’] or None (default=None)

Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.

convert_dtypebool, optional (default=False)

When set to True, the confusion matrix method will automatically convert the predictions, ground truth, and labels arrays to np.int32.

Returns
Carray-like (device or host) shape = (n_classes, n_classes)

Confusion matrix.

cuml.metrics.kl_divergence(P, Q, handle=None, convert_dtype=True)[source]

Calculates the “Kullback-Leibler” Divergence The KL divergence tells us how well the probability distribution Q approximates the probability distribution P It is often also used as a ‘distance metric’ between two probablity ditributions (not symmetric)

Parameters
PDense array of probabilities corresponding to distribution P

shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.

QDense array of probabilities corresponding to distribution Q

shape = (n_samples, 1) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.

handlecuml.Handle
convert_dtypebool, optional (default = True)

When set to True, the method will, convert P and Q to be the same data type: float32. This will increase memory used for the method.

Returns
——-
float

The KL Divergence value

cuml.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None) float[source]

Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true. The log loss is only defined for two or more labels.

Parameters
y_truearray-like, shape = (n_samples,)
y_predarray-like of float,

shape = (n_samples, n_classes) or (n_samples,)

epsfloat (default=1e-15)

Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1 - eps, p)).

normalizebool, optional (default=True)

If true, return the mean loss per sample. Otherwise, return the sum of the per-sample losses.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
lossfloat

Notes

The logarithm used is the natural logarithm (base-e).

References

C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, p. 209.

Examples

>>> from cuml.metrics import log_loss
>>> import cupy as cp
>>> log_loss(cp.array([1, 0, 0, 1]),
...          cp.array([[.1, .9], [.9, .1], [.8, .2], [.35, .65]]))
0.21616...
cuml.metrics.roc_auc_score(y_true, y_score)[source]

Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

Note

this implementation can only be used with binary classification.

Parameters
y_truearray-like of shape (n_samples,)

True labels. The binary cases expect labels with shape (n_samples,)

y_scorearray-like of shape (n_samples,)

Target scores. In the binary cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers). The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.

Returns
aucfloat

Examples

>>> import numpy as np
>>> from cuml.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> print(roc_auc_score(y_true, y_scores))
0.75
cuml.metrics.precision_recall_curve(y_true, probs_pred) Tuple[cuml.common.array.CumlArray, cuml.common.array.CumlArray, cuml.common.array.CumlArray][source]

Compute precision-recall pairs for different probability thresholds

Note

this implementation is restricted to the binary classification task. The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.

Read more in the scikit-learn’s User Guide.

Parameters
y_truearray, shape = [n_samples]

True binary labels, {0, 1}.

probas_predarray, shape = [n_samples]

Estimated probabilities or decision function.

Returns
precisionarray, shape = [n_thresholds + 1]

Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.

recallarray, shape = [n_thresholds + 1]

Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.

thresholdsarray, shape = [n_thresholds <= len(np.unique(probas_pred))]

Increasing thresholds on the decision function used to compute precision and recall.

Examples

>>> import cupy as cp
>>> from cuml.metrics import precision_recall_curve
>>> y_true = cp.array([0, 0, 1, 1])
>>> y_scores = cp.array([0.1, 0.4, 0.35, 0.8])
>>> precision, recall, thresholds = precision_recall_curve(
...     y_true, y_scores)
>>> print(precision)
[0.666... 0.5  1.  1. ]
>>> print(recall)
[1. 0.5 0.5 0. ]
>>> print(thresholds)
[0.35 0.4 0.8 ]
cuml.metrics.pairwise_distances.pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]

Compute the distance matrix from a vector array X and optional Y.

This method takes either one or two vector arrays, and returns a distance matrix.

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

Valid values for metric are:

  • From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

    Sparse matrices are supported, see ‘sparse_pairwise_distances’.

  • From scipy.spatial.distance: [‘sqeuclidean’]

    See the documentation for scipy.spatial.distance for details on this metric. Sparse matrices are supported.

Parameters
XDense or sparse matrix (device or host) of shape

(n_samples_x, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy, or cupyx.scipy.sparse for sparse input

Yarray-like (device or host) of shape (n_samples_y, n_features), optional

Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”}

The metric to use when calculating distance between instances in a feature array.

convert_dtypebool, optional (default = True)

When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

Returns
Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]

A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.

Examples

>>> import cupy as cp
>>> from cuml.metrics import pairwise_distances
>>>
>>> X = cp.array([[2.0, 3.0], [3.0, 5.0], [5.0, 8.0]])
>>> Y = cp.array([[1.0, 0.0], [2.0, 1.0]])
>>>
>>> # Euclidean Pairwise Distance, Single Input:
>>> pairwise_distances(X, metric='euclidean')
array([[0.        , 2.236..., 5.830...],
    [2.236..., 0.        , 3.605...],
    [5.830..., 3.605..., 0.        ]])
>>>
>>> # Cosine Pairwise Distance, Multi-Input:
>>> pairwise_distances(X, Y, metric='cosine')
array([[0.445... , 0.131...],
    [0.485..., 0.156...],
    [0.470..., 0.146...]])
>>>
>>> # Manhattan Pairwise Distance, Multi-Input:
>>> pairwise_distances(X, Y, metric='manhattan')
array([[ 4.,  2.],
    [ 7.,  5.],
    [12., 10.]])
cuml.metrics.pairwise_distances.sparse_pairwise_distances(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]

Compute the distance matrix from a vector array X and optional Y.

This method takes either one or two sparse vector arrays, and returns a dense distance matrix.

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

Valid values for metric are:

  • From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

  • From scipy.spatial.distance: [‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘jaccard’, ‘chebyshev’, ‘dice’]

    See the documentation for scipy.spatial.distance for details on these metrics.

  • [‘inner_product’, ‘hellinger’]

Parameters
Xarray-like (device or host) of shape (n_samples_x, n_features)

Acceptable formats: SciPy or Cupy sparse array

Yarray-like (device or host) of shape (n_samples_y, n_features), optional

Acceptable formats: SciPy or Cupy sparse array

metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”, “canberra”, “lp”, “inner_product”, “minkowski”, “jaccard”, “hellinger”, “chebyshev”, “linf”, “dice”}

The metric to use when calculating distance between instances in a feature array.

convert_dtypebool, optional (default = True)

When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

metric_argfloat, optional (default = 2)

Additionnal metric-specific argument. For Minkowski it’s the p-norm to apply.

Returns
Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]

A dense distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_{i, j} is the distance between the ith array from X and the jth array from Y.

Examples

>>> import cupyx
>>> from cuml.metrics import sparse_pairwise_distances

>>> X = cupyx.scipy.sparse.random(2, 3, density=0.5, random_state=9)
>>> Y = cupyx.scipy.sparse.random(1, 3, density=0.5, random_state=9)
>>> X.todense()
array([[0.8098..., 0.537..., 0. ],
    [0.        , 0.856..., 0. ]])
>>> Y.todense()
array([[0.        , 0.        , 0.993...]])
>>> # Cosine Pairwise Distance, Single Input:
>>> sparse_pairwise_distances(X, metric='cosine')
array([[0.      , 0.447...],
    [0.447..., 0.        ]])

>>> # Squared euclidean Pairwise Distance, Multi-Input:
>>> sparse_pairwise_distances(X, Y, metric='sqeuclidean')
array([[1.931...],
    [1.720...]])

>>> # Canberra Pairwise Distance, Multi-Input:
>>> sparse_pairwise_distances(X, Y, metric='canberra')
array([[3.],
    [2.]])
cuml.metrics.pairwise_kernels.pairwise_kernels(X, Y=None, metric='linear', *, filter_params=False, convert_dtype=True, **kwds)[source]

Compute the kernel between arrays X and optional array Y. This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector array, the kernels are computed. If the input is a kernel matrix, it is returned instead. This method provides a safe way to take a kernel matrix as input, while preserving compatibility with many other algorithms that take a vector array. If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X and Y. Valid values for metric are: [‘additive_chi2’, ‘chi2’, ‘linear’, ‘poly’, ‘polynomial’, ‘rbf’, ‘laplacian’, ‘sigmoid’, ‘cosine’]

Parameters
XDense matrix (device or host) of shape (n_samples_X, n_samples_X) or (n_samples_X, n_features)

Array of pairwise kernels between samples, or a feature array. The shape of the array should be (n_samples_X, n_samples_X) if metric == “precomputed” and (n_samples_X, n_features) otherwise. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

YDense matrix (device or host) of shape (n_samples_Y, n_features), default=None

A second feature array only if X has shape (n_samples_X, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

metricstr or callable (numba device function), default=”linear”

The metric to use when calculating kernel between instances in a feature array. If metric is “precomputed”, X is assumed to be a kernel matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two rows from X as input and return the corresponding kernel value as a single number.

filter_paramsbool, default=False

Whether to filter invalid parameters or not.

convert_dtypebool, optional (default = True)

When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.

**kwdsoptional keyword parameters

Any further parameters are passed directly to the kernel function.

Returns
Kndarray of shape (n_samples_X, n_samples_X) or (n_samples_X, n_samples_Y)

A kernel matrix K such that K_{i, j} is the kernel between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then K_{i, j} is the kernel between the ith array from X and the jth array from Y.

Notes

If metric is ‘precomputed’, Y is ignored and X is returned.

Examples

>>> import cupy as cp
>>> from cuml.metrics import pairwise_kernels
>>> from numba import cuda
>>> import math

>>> X = cp.array([[2, 3], [3, 5], [5, 8]])
>>> Y = cp.array([[1, 0], [2, 1]])

>>> pairwise_kernels(X, Y, metric='linear')
array([[ 2,  7],
    [ 3, 11],
    [ 5, 18]])
>>> @cuda.jit(device=True)
... def custom_rbf_kernel(x, y, gamma=None):
...     if gamma is None:
...         gamma = 1.0 / len(x)
...     sum = 0.0
...     for i in range(len(x)):
...         sum += (x[i] - y[i]) ** 2
...     return math.exp(-gamma * sum)

>>> pairwise_kernels(X, Y, metric=custom_rbf_kernel) 
array([[6.73794700e-03, 1.35335283e-01],
    [5.04347663e-07, 2.03468369e-04],
    [4.24835426e-18, 2.54366565e-13]])

Metrics (clustering and manifold learning)

cuml.metrics.trustworthiness.trustworthiness(X, X_embedded, handle=None, n_neighbors=5, metric='euclidean', convert_dtype=True, batch_size=512) double[source]

Expresses to what extent the local structure is retained in embedding. The score is defined in the range [0, 1].

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

X_embeddedarray-like (device or host) shape= (n_samples, n_features)

Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

n_neighborsint, optional (default=5)

Number of neighbors considered

metricstr in [‘euclidean’] (default=’euclidean’)

Metric used to compute the trustworthiness. For the moment only ‘euclidean’ is supported.

convert_dtypebool, optional (default=False)

When set to True, the trustworthiness method will automatically convert the inputs to np.float32.

batch_sizeint (default=512)

The number of samples to use for each batch.

Returns
trustworthiness scoredouble

Trustworthiness of the low-dimensional embedding

cuml.metrics.cluster.adjusted_rand_index.adjusted_rand_score(labels_true, labels_pred, handle=None, convert_dtype=True) float[source]

Adjusted_rand_score is a clustering similarity metric based on the Rand index and is corrected for chance.

Parameters
labels_trueGround truth labels to be used as a reference
labels_predArray of predicted labels used to evaluate the model
handlecuml.Handle
Returns
float

The adjusted rand index value between -1.0 and 1.0

cuml.metrics.cluster.entropy.cython_entropy(clustering, base=None, handle=None) float[source]

Computes the entropy of a distribution for given probability values.

Parameters
clusteringarray-like (device or host) shape = (n_samples,)

Clustering of labels. Probabilities are computed based on occurrences of labels. For instance, to represent a fair coin (2 equally possible outcomes), the clustering could be [0,1]. For a biased coin with 2/3 probability for tail, the clustering could be [0, 0, 1].

base: float, optional

The logarithmic base to use, defaults to e (natural logarithm).

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns
Sfloat

The calculated entropy.

cuml.metrics.cluster.homogeneity_score.cython_homogeneity_score(labels_true, labels_pred, handle=None) float[source]

Computes the homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters
labels_predarray-like (device or host) shape = (n_samples,)

The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)

The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns
float

The homogeneity of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling.

cuml.metrics.cluster.silhouette_score.cython_silhouette_samples(X, labels, metric='euclidean', chunksize=None, handle=None)[source]

Calculate the silhouette coefficient for each sample in the provided data.

Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).

Parameters
Xarray-like, shape = (n_samples, n_features)

The feature vectors for all samples.

labelsarray-like, shape = (n_samples,)

The assigned cluster labels for each sample.

metricstring

A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.

chunksizeinteger (default = None)

An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

cuml.metrics.cluster.silhouette_score.cython_silhouette_score(X, labels, metric='euclidean', chunksize=None, handle=None)[source]

Calculate the mean silhouette coefficient for the provided data.

Given a set of cluster labels for every sample in the provided data, compute the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The silhouette coefficient for a sample is then (b - a) / max(a, b).

Parameters
Xarray-like, shape = (n_samples, n_features)

The feature vectors for all samples.

labelsarray-like, shape = (n_samples,)

The assigned cluster labels for each sample.

metricstring

A string representation of the distance metric to use for evaluating the silhouette score. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.

chunksizeinteger (default = None)

An integer, 1 <= chunksize <= n_samples to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automatically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

cuml.metrics.cluster.completeness_score.cython_completeness_score(labels_true, labels_pred, handle=None) float[source]

Completeness metric of a cluster labeling given a ground truth.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters
labels_predarray-like (device or host) shape = (n_samples,)

The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)

The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns
float

The completeness of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling.

cuml.metrics.cluster.mutual_info_score.cython_mutual_info_score(labels_true, labels_pred, handle=None) float[source]

Computes the Mutual Information between two clusterings.

The Mutual Information is a measure of the similarity between two labels of the same data.

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.

Parameters
handlecuml.Handle
labels_predarray-like (device or host) shape = (n_samples,)

A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

labels_truearray-like (device or host) shape = (n_samples,)

A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
float

Mutual information, a non-negative value

Benchmarking

class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]

Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers fit and predict or transform methods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating

Parameters
cpu_classclass

Class for CPU version of algorithm. Set to None if not available.

cuml_classclass

Class for cuML algorithm

shared_argsdict

Arguments passed to both implementations’s initializer

cuml_argsdict

Arguments only passed to cuml’s initializer

cpu_args dict

Arguments only passed to sklearn’s initializer

accepts_labelsboolean

If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.

data_prep_hookfunction (data -> data)

Optional function to run on input data before passing to fit

accuracy_functionfunction (y_test, y_pred)

Function that returns a scalar representing accuracy

bench_funccustom function to perform fit/predict/transform

calls.

Methods

run_cpu(data[, bench_args])

Runs the cpu-based algorithm's fit method on specified data

run_cuml(data[, bench_args])

Runs the cuml-based algorithm's fit method on specified data

setup_cpu

setup_cuml

run_cpu(data, bench_args={}, **override_setup_args)[source]

Runs the cpu-based algorithm’s fit method on specified data

run_cuml(data, bench_args={}, **override_setup_args)[source]

Runs the cuml-based algorithm’s fit method on specified data

cuml.benchmark.algorithms.algorithm_by_name(name)[source]

Returns the algorithm pair with the name ‘name’ (case-insensitive)

cuml.benchmark.algorithms.all_algorithms()[source]

Returns all defined AlgorithmPair objects

Wrappers to run ML benchmarks

class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]

Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.

class cuml.benchmark.runners.BenchmarkTimer(reps=1)[source]

Provides a context manager that runs a code block reps times and records results to the instance variable timings. Use like:

timer = BenchmarkTimer(rep=5)
for _ in timer.benchmark_runs():
    ... do something ...
print(np.min(timer.timings))

Methods

benchmark_runs

class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]

Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.

Methods

run

cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], dtype=<class 'numpy.float32'>, input_type='numpy', test_fraction=0.1, run_cpu=True, raise_on_error=False, n_reps=1)[source]

Runs each algo in algos once per bench_rows X bench_dims X params_override_list X cuml_param_override_list combination and returns a dataframe containing timing and accuracy data.

Parameters
algosstr or list

Name of algorithms to run and evaluate

dataset_namestr

Name of dataset to use

bench_rowslist of int

Dataset row counts to test

bench_dimslist of int

Dataset column counts to test

param_override_listlist of dict

Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.

cuml_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cuml algo only.

cpu_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cpu algo only.

dataset_param_override_listdict

Dicts containing parameters to pass to dataset generator function

dtype: [np.float32|np.float64]

Specifies the dataset precision to be used for benchmarking.

test_fractionfloat

The fraction of data to use for testing.

run_cpuboolean

If True, run the cpu-based algorithm for comparison

Data generators for cuML benchmarks

The main entry point for consumers is gen_data, which wraps the underlying data generators.

Notes when writing new generators:

Each generator is a function that accepts:
  • n_samples (set to 0 for ‘default’)

  • n_features (set to 0 for ‘default’)

  • random_state

  • (and optional generator-specific parameters)

The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)

A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.

cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=0, n_features=0, test_fraction=0.0, **kwargs)[source]

Returns a tuple of data from the specified generator.

Parameters
dataset_namestr

Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)

dataset_formatstr

Type of data to return. (One of cudf, numpy, pandas, gpuarray)

n_samplesint

Number of samples to include in training set (regardless of test split)

test_fractionfloat

Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.

cuml.benchmark.datagen.load_higgs()[source]

Returns the Higgs Boson dataset as an X, y tuple of dataframes.

Regression and Classification

Linear Regression

class cuml.LinearRegression(*, algorithm='eig', fit_intercept=True, normalize=False, handle=None, verbose=False, output_type=None)

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.

Parameters
algorithm{‘svd’, ‘eig’, ‘qr’, ‘svd-qr’, ‘svd-jacobi’}, (default = ‘eig’)

Choose an algorithm:

  • ‘svd’ - alias for svd-jacobi;

  • ‘eig’ - use an eigendecomposition of the covariance matrix;

  • ‘qr’ - use QR decomposition algorithm and solve Rx = Q^T y

  • ‘svd-qr’ - compute SVD decomposition using QR algorithm

  • ‘svd-jacobi’ - compute SVD decomposition using Jacobi iterations.

Among these algorithms, only ‘svd-jacobi’ supports the case when the number of features is larger than the sample size; this algorithm is force-selected automatically in such a case.

For the broad range of inputs, ‘eig’ and ‘qr’ are usually the fastest, followed by ‘svd-jacobi’ and then ‘svd-qr’. In theory, SVD-based algorithms are more stable.

fit_interceptboolean (default = True)

If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

This parameter is ignored when fit_intercept is set to False. If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.

Applications of LinearRegression

LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).

For additional information, see scikitlearn’s OLS documentation.

For an additional example see the OLS notebook.

Examples

>>> import cupy as cp
>>> import cudf

>>> # Both import methods supported
>>> from cuml import LinearRegression
>>> from cuml.linear_model import LinearRegression
>>> lr = LinearRegression(fit_intercept = True, normalize = False,
...                       algorithm = "eig")
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)
>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))
>>> reg = lr.fit(X,y)
>>> print(reg.coef_)
0   1.0
1   2.0
dtype: float32
>>> print(reg.intercept_)
3.0...

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)
>>> preds = lr.predict(X_new)
>>> print(preds)
0   15.999...
1   14.999...
dtype: float32
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, convert_dtype, sample_weight])

Fit the model with X and y.

get_param_names(self)

fit(self, X, y, convert_dtype=True, sample_weight=None) 'LinearRegression'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

get_param_names(self)[source]

Logistic Regression

class cuml.LogisticRegression(*, penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, class_weight=None, max_iter=1000, linesearch_max_iter=50, verbose=False, l1_ratio=None, solver='qn', handle=None, output_type=None)

LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.

cuML’s LogisticRegression can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides both single-class (using sigmoid loss) and multiple-class (using softmax loss) variants, depending on the input variables

Only one solver option is currently available: Quasi-Newton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

  • Limited Memory BFGS (L-BFGS) otherwise.

Note that, just like in Scikit-learn, the bias will not be regularized.

Parameters
penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)

Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then L-BFGS solver will be used. If ‘l1’ is selected, solver OWL-QN will be used. If ‘elasticnet’ is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used.

tolfloat (default = 1e-4)

Tolerance for stopping criteria. The exact stopping conditions depend on the chosen solver. Check the solver’s documentation for more details:

Cfloat (default = 1.0)

Inverse of regularization strength; must be a positive float.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

class_weightNone

Custom class weighs are currently not supported.

class_weightdict or ‘balanced’, default=None

By default all classes have a weight one. However, a dictionary can be provided with weights associated with classes in the form {class_label: weight}. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

max_iterint (default = 1000)

Maximum number of iterations taken for the solvers to converge.

linesearch_max_iterint (default = 50)

Max number of linesearch iterations per outer iteration used in the lbfgs and owl QN solvers.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

l1_ratiofloat or None, optional (default=None)

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1

solver‘qn’, ‘lbfgs’, ‘owl’ (default=’qn’).

Algorithm to use in the optimization problem. Currently only qn is supported, which automatically selects either L-BFGS or OWL-QN depending on the conditions of the l1 regularization described above. Options ‘lbfgs’ and ‘owl’ are just convenience values that end up using the same solver following the same rules.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

cuML’s LogisticRegression uses a different solver that the equivalent Scikit-learn, except when there is no penalty and solver=lbfgs is used in Scikit-learn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to using different solvers in Scikit-learn.

For additional information, see Scikit-learn’s LogisticRegression.

Examples

>>> import cudf
>>> import numpy as np

>>> # Both import methods supported
>>> # from cuml import LogisticRegression
>>> from cuml.linear_model import LogisticRegression

>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([1,1,2,2], dtype = np.float32)
>>> X['col2'] = np.array([1,2,2,3], dtype = np.float32)
>>> y = cudf.Series(np.array([0.0, 0.0, 1.0, 1.0], dtype=np.float32))

>>> reg = LogisticRegression()
>>> reg.fit(X,y)
LogisticRegression()
>>> print(reg.coef_)
0    0.698...
1    0.570...
dtype: float32
>>> print(reg.intercept_)
0   -2.188...
dtype: float32

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = np.array([1,5], dtype = np.float32)
>>> X_new['col2'] = np.array([2,5], dtype = np.float32)

>>> preds = reg.predict(X_new)

>>> print(preds)
0    0.0
1    1.0
dtype: float32
Attributes
coef_: dev array, dim (n_classes, n_features) or (n_classes, n_features+1)

The estimated coefficients for the linear regression model.

intercept_: device array (n_classes, 1)

The independent term. If fit_intercept is False, will be 0.

Methods

decision_function(self, X[, convert_dtype])

Gives confidence score for X

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

predict_log_proba(self, X[, convert_dtype])

Predicts the log class probabilities for each class in X

predict_proba(self, X[, convert_dtype])

Predicts the class probabilities for each class in X

set_params(self, **params)

decision_function(self, X, convert_dtype=False) CumlArray[source]

Gives confidence score for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the decision_function method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
scorecuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Confidence score

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(self, X, y, sample_weight=None, convert_dtype=True) 'LogisticRegression'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=True) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba(self, X, convert_dtype=True) CumlArray[source]

Predicts the log class probabilities for each class in X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the predict_log_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Logaright of predicted class probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, convert_dtype=True) CumlArray[source]

Predicts the class probabilities for each class in X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the predict_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Predicted class probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(self, **params)[source]

Ridge Regression

class cuml.Ridge(*, alpha=1.0, solver='eig', fit_intercept=True, normalize=False, handle=None, output_type=None, verbose=False)

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s Ridge can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides 3 algorithms: SVD, Eig and CD to fit a linear model. In general SVD uses significantly more memory and is slower than Eig. If using CUDA 10.1, the memory difference is even bigger than in the other supported CUDA versions. However, SVD is more stable than Eig (default). CD uses Coordinate Descent and can be faster when data is large.

Parameters
alphafloat (default = 1.0)

Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.

solver{‘eig’, ‘svd’, ‘cd’} (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.

fit_interceptboolean (default = True)

If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Notes

Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretability on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.

Applications of Ridge

Ridge Regression is used in the same way as LinearRegression, but does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.

For additional docs, see Scikit-learn’s Ridge Regression.

Examples

>>> import cupy as cp
>>> import cudf

>>> # Both import methods supported
>>> from cuml import Ridge
>>> from cuml.linear_model import Ridge

>>> alpha = cp.array([1e-5])
>>> ridge = Ridge(alpha=alpha, fit_intercept=True, normalize=False,
...               solver="eig")

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)

>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))

>>> result_ridge = ridge.fit(X, y)
>>> print(result_ridge.coef_) 
0 1.000...
1 1.999...
>>> print(result_ridge.intercept_)
3.0...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)
>>> preds = result_ridge.predict(X_new)
>>> print(preds) 
0 15.999...
1 14.999...
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

set_params(self, **params)

fit(self, X, y, convert_dtype=True) 'Ridge'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
set_params(self, **params)[source]

Lasso Regression

class cuml.Lasso(*, alpha=1.0, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)[source]

Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.

cuML’s Lasso can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

solver{‘cd’, ‘qn’} (default=’cd’)

Choose an algorithm:

  • ‘cd’ - coordinate descent

  • ‘qn’ - quasi-newton

You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Notes

For additional docs, see scikitlearn’s Lasso.

Examples

>>> import numpy as np
>>> import cudf
>>> from cuml.linear_model import Lasso
>>> ls = Lasso(alpha = 0.1)
>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([0, 1, 2], dtype = np.float32)
>>> X['col2'] = np.array([0, 1, 2], dtype = np.float32)
>>> y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )
>>> result_lasso = ls.fit(X, y)
>>> print(result_lasso.coef_)
0   0.85
1   0.00
dtype: float32
>>> print(result_lasso.intercept_)
0.149999...

>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = np.array([3,2], dtype = np.float32)
>>> X_new['col2'] = np.array([5,5], dtype = np.float32)
>>> preds = result_lasso.predict(X_new)
>>> print(preds)
0   2.70
1   1.85
dtype: float32
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept is False, will be 0.

Methods

get_param_names(self)

get_param_names(self)[source]

ElasticNet Regression

class cuml.ElasticNet(*, alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, solver='cd', selection='cyclic', handle=None, output_type=None, verbose=False)

ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

l1_ratiofloat (default = 0.5)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by the column-wise standard deviation. If False, no scaling will be done. Note: this is in contrast to sklearn’s deprecated normalize flag, which divides by the column-wise L2 norm; but this is the same as if using sklearn’s StandardScaler.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

solver{‘cd’, ‘qn’} (default=’cd’)

Choose an algorithm:

  • ‘cd’ - coordinate descent

  • ‘qn’ - quasi-newton

You may find the alternative ‘qn’ algorithm is faster when the number of features is sufficiently large, but the sample size is small.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Notes

For additional docs, see scikitlearn’s ElasticNet.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import ElasticNet
>>> enet = ElasticNet(alpha = 0.1, l1_ratio=0.5)
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([0, 1, 2], dtype = cp.float32)
>>> X['col2'] = cp.array([0, 1, 2], dtype = cp.float32)
>>> y = cudf.Series(cp.array([0.0, 1.0, 2.0], dtype = cp.float32) )
>>> result_enet = enet.fit(X, y)
>>> print(result_enet.coef_)
0    0.448...
1    0.443...
dtype: float32
>>> print(result_enet.intercept_)
0.1082506...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype = cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype = cp.float32)
>>> preds = result_enet.predict(X_new)
>>> print(preds)
0    3.670...
1    3.221...
dtype: float32
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

set_params(self, **params)

fit(self, X, y, convert_dtype=True) 'ElasticNet'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
set_params(self, **params)[source]

Mini Batch SGD Classifier

class cuml.MBSGDClassifier(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)

Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier: * Reduce the batch size * Increase the eta0 * Increase the number of iterations Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.

Parameters
loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘hinge’)

‘hinge’ uses linear SVM

‘log’ uses logistic regression

‘squared_loss’ uses linear regression

penalty{‘none’, ‘l1’, ‘l2’, ‘elasticnet’} (default = ‘l2’)

‘none’ does not perform any regularization

‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients

‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients

‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

l1_ratiofloat (default=0.15)

The l1_ratio is used only when penalty = elasticnet. The value for l1_ratio should be 0 <= l1_ratio <= 1. When l1_ratio = 0 then the penalty = 'l2' and if l1_ratio = 1 then penalty = 'l1'

batch_sizeint (default = 32)

It sets the number of samples that will be included in each batch.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional docs, see scikitlearn’s SGDClassifier.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import MBSGDClassifier
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)
>>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32)
>>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32)
>>> cu_mbsgd_classifier = MBSGDClassifier(learning_rate='constant',
...                                       eta0=0.05, epochs=2000,
...                                       fit_intercept=True,
...                                       batch_size=1, tol=0.0,
...                                       penalty='l2',
...                                       loss='squared_loss',
...                                       alpha=0.5)
>>> cu_mbsgd_classifier.fit(X, y)
MBSGDClassifier()
>>> print("cuML intercept : ", cu_mbsgd_classifier.intercept_)
cuML intercept :  0.725...
>>> print("cuML coef : ", cu_mbsgd_classifier.coef_)
cuML coef :  0    0.273...
1    0.182...
dtype: float32
>>> cu_pred = cu_mbsgd_classifier.predict(pred_data)
>>> print("cuML predictions : ", cu_pred)
cuML predictions :  0   1.0
1    1.0
dtype: float32

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

fit(self, X, y, convert_dtype=True) 'MBSGDClassifier'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(self, **params)[source]

Mini Batch SGD Regressor

class cuml.MBSGDRegressor(*, loss='squared_loss', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)

Linear regression model fitted by minimizing a regularized empirical loss with mini-batch SGD. The MBSGD Regressor implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGD Regressor: * Reduce the batch size * Increase the eta0 * Increase the number of iterations Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.

Parameters
loss‘squared_loss’ (default = ‘squared_loss’)

‘squared_loss’ uses linear regression

penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_ratiofloat (default=0.15)

The l1_ratio is used only when penalty = elasticnet. The value for l1_ratio should be 0 <= l1_ratio <= 1. When l1_ratio = 0 then the penalty = 'l2' and if l1_ratio = 1 then penalty = 'l1'

batch_sizeint (default = 32)

It sets the number of samples that will be included in each batch.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’} (default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional docs, see scikitlearn’s SGDRegressor.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor
>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype = cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype = cp.float32)
>>> y = cudf.Series(cp.array([1, 1, 2, 2], dtype=cp.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = cp.asarray([3, 2], dtype=cp.float32)
>>> pred_data['col2'] = cp.asarray([5, 5], dtype=cp.float32)
>>> cu_mbsgd_regressor = cumlMBSGDRegressor(learning_rate='constant',
...                                         eta0=0.05, epochs=2000,
...                                         fit_intercept=True,
...                                         batch_size=1, tol=0.0,
...                                         penalty='l2',
...                                         loss='squared_loss',
...                                         alpha=0.5)
>>> cu_mbsgd_regressor.fit(X, y)
MBSGDRegressor()
>>> print("cuML intercept : ", cu_mbsgd_regressor.intercept_)
cuML intercept :  0.725...
>>> print("cuML coef : ", cu_mbsgd_regressor.coef_)
cuML coef :  0    0.273...
1     0.182...
dtype: float32
>>> cu_pred = cu_mbsgd_regressor.predict(pred_data)
>>> print("cuML predictions : ", cu_pred)
cuML predictions :  0    2.456...
1    2.183...
dtype: float32

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

fit(self, X, y, convert_dtype=True) 'MBSGDRegressor'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

set_params(self, **params)[source]

Multiclass Classification

class cuml.multiclass.MulticlassClassifier(estimator, *, handle=None, verbose=False, output_type=None, strategy='ovr')[source]

Wrapper around scikit-learn multiclass classifiers that allows to choose different multiclass strategies.

The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.

Parameters
estimatorcuML estimator
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

strategy: string {‘ovr’, ‘ovo’}, default=’ovr’

Multiclass classification strategy: ‘ovr’: one vs. rest or ‘ovo’: one vs. one

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import MulticlassClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = MulticlassClassifier(LogisticRegression(), strategy='ovo')
>>> cls.fit(X,y)
MulticlassClassifier()
>>> cls.predict(X)
array([2, 0, 2, 2, 2, 1, 1, 0, 1, 1])
Attributes
classes_float, shape (n_classes_)

Array of class labels.

n_classes_int

Number of classes.

Methods

decision_function(X)

Calculate the decision function.

fit(X, y)

Fit a multiclass classifier.

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

predict(X)

Predict using multi class classifier.

decision_function(X) cuml.common.array.CumlArray[source]

Calculate the decision function.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Decision function values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(X, y) cuml.multiclass.multiclass.MulticlassClassifier[source]

Fit a multiclass classifier.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

predict(X) cuml.common.array.CumlArray[source]

Predict using multi class classifier.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class cuml.multiclass.OneVsOneClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]

Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.

For documentation see scikit-learn’s OneVsOneClassifier.

Parameters
estimatorcuML estimator
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import OneVsOneClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = OneVsOneClassifier(LogisticRegression())
>>> cls.fit(X,y)
OneVsOneClassifier()
>>> cls.predict(X)
array([2, 0, 2, 2, 2, 1, 1, 0, 1, 1])

Methods

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

class cuml.multiclass.OneVsRestClassifier(estimator, *args, handle=None, verbose=False, output_type=None)[source]

Wrapper around Sckit-learn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.

Berofe passing the data to scikit-learn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.

For documentation see scikit-learn’s OneVsRestClassifier.

Parameters
estimatorcuML estimator
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> from cuml.linear_model import LogisticRegression
>>> from cuml.multiclass import OneVsRestClassifier
>>> from cuml.datasets.classification import make_classification

>>> X, y = make_classification(n_samples=10, n_features=6,
...                            n_informative=4, n_classes=3,
...                            random_state=137)

>>> cls = OneVsRestClassifier(LogisticRegression())
>>> cls.fit(X,y)
OneVsRestClassifier()
>>> cls.predict(X)
array([2, 0, 2, 2, 2, 1, 1, 0, 1, 1])

Methods

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

Naive Bayes

class cuml.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]

Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Parameters
alphafloat (default=1.0)

Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_priorboolean (default=True)

Whether to learn class prior probabilities or no. If false, a uniform prior will be used.

class_priorarray-like, size (n_classes) (default=None)

Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Examples

Load the 20 newsgroups dataset from Scikit-learn and train a Naive Bayes classifier.

>>> import cupy as cp
>>> import cupyx
>>> from sklearn.datasets import fetch_20newsgroups
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from cuml.naive_bayes import MultinomialNB

>>> # Load corpus
>>> twenty_train = fetch_20newsgroups(subset='train', shuffle=True,
...                                   random_state=42)

>>> # Turn documents into term frequency vectors

>>> count_vect = CountVectorizer()
>>> features = count_vect.fit_transform(twenty_train.data)

>>> # Put feature vectors and labels on the GPU

>>> X = cupyx.scipy.sparse.csr_matrix(features.tocsr(),
...                                   dtype=cp.float32)
>>> y = cp.asarray(twenty_train.target, dtype=cp.int32)

>>> # Train model

>>> model = MultinomialNB()
>>> model.fit(X, y)
MultinomialNB()

>>> # Compute accuracy on training set

>>> model.score(X, y)
0.9245...
Attributes
class_count_ndarray of shape (n_classes)

Number of samples encountered for each class during fitting.

class_log_prior_ndarray of shape (n_classes)

Log probability of each class (smoothed).

classes_ndarray of shape (n_classes,)

Class labels known to the classifier

feature_count_ndarray of shape (n_classes, n_features)

Number of samples encountered for each (class, feature) during fitting.

feature_log_prob_ndarray of shape (n_classes, n_features)

Empirical log probability of features given a class, P(x_i|y).

n_features_int

Number of features of each sample.

class cuml.naive_bayes.BernoulliNB(*, alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]

Naive Bayes classifier for multivariate Bernoulli models. Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

Parameters
alphafloat, default=1.0

Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

binarizefloat or None, default=0.0

Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.

fit_priorbool, default=True

Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

class_priorarray-like of shape (n_classes,), default=None

Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

References

C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html A. McCallum and K. Nigam (1998). A comparison of event models for naive Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with naive Bayes – Which naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).

Examples

>>> import cupy as cp
>>> rng = cp.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100), dtype=cp.int32)
>>> Y = cp.array([1, 2, 3, 4, 4, 5])
>>> from cuml.naive_bayes import BernoulliNB
>>> clf = BernoulliNB()
>>> clf.fit(X, Y)
BernoulliNB()
>>> print(clf.predict(X[2:3]))
[3]
Attributes
class_count_ndarray of shape (n_classes)

Number of samples encountered for each class during fitting.

class_log_prior_ndarray of shape (n_classes)

Log probability of each class (smoothed).

classes_ndarray of shape (n_classes,)

Class labels known to the classifier

feature_count_ndarray of shape (n_classes, n_features)

Number of samples encountered for each (class, feature) during fitting.

feature_log_prob_ndarray of shape (n_classes, n_features)

Empirical log probability of features given a class, P(x_i|y).

n_features_int

Number of features of each sample.

Methods

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

class cuml.naive_bayes.GaussianNB(*, priors=None, var_smoothing=1e-09, output_type=None, handle=None, verbose=False)[source]

Gaussian Naive Bayes (GaussianNB) Can perform online updates to model parameters via partial_fit(). For details on algorithm used to update feature means and variance online, see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

Parameters
priorsarray-like of shape (n_classes,)

Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

var_smoothingfloat, default=1e-9

Portion of the largest variance of all features that is added to variances for calculation stability.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Examples

>>> import cupy as cp
>>> X = cp.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1],
...                 [3, 2]], cp.float32)
>>> Y = cp.array([1, 1, 1, 2, 2, 2], cp.float32)
>>> from cuml.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB()
>>> print(clf.predict(cp.array([[-0.8, -1]], cp.float32)))
[1]
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, cp.unique(Y))
GaussianNB()
>>> print(clf_pf.predict(cp.array([[-0.8, -1]], cp.float32)))
[1]

Methods

fit(X, y[, sample_weight])

Fit Gaussian Naive Bayes classifier according to X, y

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

partial_fit(X, y[, classes, sample_weight])

Incremental fit on a batch of samples.

fit(X, y, sample_weight=None) cuml.naive_bayes.naive_bayes.GaussianNB[source]

Fit Gaussian Naive Bayes classifier according to X, y

Parameters
X{array-like, cupy sparse matrix} of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features.

yarray-like shape (n_samples) Target values.
sample_weightarray-like of shape (n_samples)

Weights applied to individial samples (1. for unweighted). Currently sample weight is ignored.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

partial_fit(X, y, classes=None, sample_weight=None) cuml.naive_bayes.naive_bayes.GaussianNB[source]

Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

Parameters
X{array-like, cupy sparse matrix} of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features. A sparse matrix in COO format is preferred, other formats will go through a conversion to COO.

yarray-like of shape (n_samples) Target values.
classesarray-like of shape (n_classes)

List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.

sample_weightarray-like of shape (n_samples)

Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns
selfobject
class cuml.naive_bayes.CategoricalNB(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]

Naive Bayes classifier for categorical features The categorical Naive Bayes classifier is suitable for classification with discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

Parameters
alphafloat, default=1.0

Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_priorbool, default=True

Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

class_priorarray-like of shape (n_classes,), default=None

Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Examples

>>> import cupy as cp
>>> rng = cp.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100), dtype=cp.int32)
>>> y = cp.array([1, 2, 3, 4, 5, 6])
>>> from cuml.naive_bayes import CategoricalNB
>>> clf = CategoricalNB()
>>> clf.fit(X, y)
CategoricalNB()
>>> print(clf.predict(X[2:3]))
[3]
Attributes
category_count_ndarray of shape (n_features, n_classes, n_categories)

With n_categories being the highest category of all the features. This array provides the number of samples encountered for each feature, class and category of the specific feature.

class_count_ndarray of shape (n_classes,)

Number of samples encountered for each class during fitting.

class_log_prior_ndarray of shape (n_classes,)

Smoothed empirical log probability for each class.

classes_ndarray of shape (n_classes,)

Class labels known to the classifier

feature_log_prob_ndarray of shape (n_features, n_classes, n_categories)

With n_categories being the highest category of all the features. Each array of shape (n_classes, n_categories) provides the empirical log probability of categories given the respective feature and class, P(x_i|y). This attribute is not available when the model has been trained with sparse data.

n_features_int

Number of features of each sample.

Methods

fit(X, y[, sample_weight])

Fit Naive Bayes classifier according to X, y

partial_fit(X, y[, classes, sample_weight])

Incremental fit on a batch of samples.

fit(X, y, sample_weight=None) cuml.naive_bayes.naive_bayes.CategoricalNB[source]

Fit Naive Bayes classifier according to X, y

Parameters
Xarray-like of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.

yarray-like of shape (n_samples,)

Target values.

sample_weightarray-like of shape (n_samples), default=None

Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns
selfobject
partial_fit(X, y, classes=None, sample_weight=None) cuml.naive_bayes.naive_bayes.CategoricalNB[source]

Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

Parameters
Xarray-like of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features. Here, each feature of X is assumed to be from a different categorical distribution. It is further assumed that all categories of each feature are represented by the numbers 0, …, n - 1, where n refers to the total number of categories for the given feature. This can, for instance, be achieved with the help of OrdinalEncoder.

yarray-like of shape (n_samples)

Target values.

classesarray-like of shape (n_classes), default=None

List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.

sample_weightarray-like of shape (n_samples), default=None

Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored.

Returns
selfobject

Stochastic Gradient Descent

class cuml.SGD(*, loss='squared_loss', penalty='none', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, output_type=None, verbose=False)

Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.

cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.

Parameters
loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)

‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression

penalty‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alphafloat (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

batch_sizeint (default=32)

The number of samples to use for each batch.

learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)

Optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5

n_iter_no_changeint (default = 5)

The number of epochs to train without any imporvement in the model

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Examples

>>> import numpy as np
>>> import cudf
>>> from cuml.solvers import SGD as cumlSGD
>>> X = cudf.DataFrame()
>>> X['col1'] = np.array([1,1,2,2], dtype=np.float32)
>>> X['col2'] = np.array([1,2,2,3], dtype=np.float32)
>>> y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
>>> pred_data = cudf.DataFrame()
>>> pred_data['col1'] = np.asarray([3, 2], dtype=np.float32)
>>> pred_data['col2'] = np.asarray([5, 5], dtype=np.float32)
>>> cu_sgd = cumlSGD(learning_rate='constant', eta0=0.005, epochs=2000,
...                  fit_intercept=True, batch_size=2,
...                  tol=0.0, penalty='none', loss='squared_loss')
>>> cu_sgd.fit(X, y)
SGD()
>>> cu_pred = cu_sgd.predict(pred_data).to_numpy()
>>> print(" cuML intercept : ", cu_sgd.intercept_) 
cuML intercept :  0.00418...
>>> print(" cuML coef : ", cu_sgd.coef_) 
cuML coef :  0      0.9841...
1      0.0097...
dtype: float32
>>> print("cuML predictions : ", cu_pred) 
cuML predictions :  [3.0055...  2.0214...]
Attributes
classes_
coef_

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

predictClass(self, X[, convert_dtype])

Predicts the y for X.

fit(self, X, y, convert_dtype=False) 'SGD'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predictClass(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predictClass method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Random Forest

class cuml.ensemble.RandomForestClassifier(*, split_criterion=0, handle=None, verbose=False, output_type=None, **kwargs)

Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

Note

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the n_bins parameter.

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters
n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

split_criterionint or string (default = 0 ('gini'))

The criterion used to split nodes.

  • 0 or 'gini' for gini impurity

  • 1 or 'entropy' for information gain (entropy)

  • 2 or 'mse' for mean squared error

  • 4 or 'poisson' for poisson half deviance

  • 5 or 'gamma' for gamma half deviance

  • 6 or 'inverse_gaussian' for inverse gaussian deviance

only 0/'gini' and 1/'entropy' valid for classification

bootstrapboolean (default = True)

Control bootstrapping.

  • If True, eachtree in the forest is built on a bootstrapped sample with replacement.

  • If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_featuresint, float, or string (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

  • If type int then max_features is the absolute count of features to be used

  • If type float then max_features is used as a fraction.

  • If 'auto' then max_features=1/sqrt(n_features).

  • If 'sqrt' then max_features=1/sqrt(n_features).

  • If 'log2' then max_features=log2(n_features)/n_features.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4)

Number of parallel streams used for forest building.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

  • If type int, then min_samples_leaf represents the minimum number.

  • If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

  • If type int, then min_samples_split represents the minimum number.

  • If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

Minimum decrease in impurity requried for node to be spilt.

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

Known Limitations

This is an early release of the cuML Random Forest code. It contains a few known limitations:

  • GPU-based inference is only supported with 32-bit (float32) datatypes. Alternatives are to use CPU-based inference for 64-bit (float64) datatypes, or let the default automatic datatype conversion occur during GPU inference.

  • While training the model for multi class classification problems, using deep trees or max_features=1.0 provides better performance.

For additional docs, see scikitlearn’s RandomForestClassifier.

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestClassifier as cuRFC

>>> X = cp.random.normal(size=(10,4)).astype(cp.float32)
>>> y = cp.asarray([0,1]*5, dtype=cp.int32)

>>> cuml_model = cuRFC(max_features=1.0,
...                    n_bins=8,
...                    n_estimators=40)
>>> cuml_model.fit(X,y)
RandomForestClassifier()
>>> cuml_predict = cuml_model.predict(X)

>>> print("Predicted labels : ", cuml_predict)
Predicted labels :  [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]

Methods

convert_to_fil_model(self[, output_class, ...])

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

fit(self, X, y[, convert_dtype])

Perform Random Forest Classification on the input data

get_detailed_text(self)

Obtain the detailed information for the random forest model, as text

get_json(self)

Export the Random Forest model as a JSON string

get_summary_text(self)

Obtain the text summary of the random forest model

predict(self, X[, predict_model, threshold, ...])

Predicts the labels for X.

predict_proba(self, X[, algo, ...])

Predicts class probabilites for X.

score(self, X, y[, threshold, algo, ...])

Calculates the accuracy metric score of the model for X.

convert_to_fil_model(self, output_class=True, threshold=0.5, algo='auto', fil_sparse_format='auto')[source]

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

Parameters
output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
fil_model

A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model(self)[source]

Converts the cuML RF model to a Treelite model

Returns
tl_to_fil_modelTreelite version of this model
fit(self, X, y, convert_dtype=True)[source]

Perform Random Forest Classification on the input data

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

convert_dtypebool, optional (default = True)

When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.

get_detailed_text(self)[source]

Obtain the detailed information for the random forest model, as text

get_json(self)[source]

Export the Random Forest model as a JSON string

get_summary_text(self)[source]

Obtain the text summary of the random forest model

predict(self, X, predict_model='GPU', threshold=0.5, algo='auto', convert_dtype=True, fil_sparse_format='auto') CumlArray[source]

Predicts the labels for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for classification problems.

algostring (default = 'auto')

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU.

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = 'auto')

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
predict_proba(self, X, algo='auto', convert_dtype=True, fil_sparse_format='auto') CumlArray[source]

Predicts class probabilites for X. This function uses the GPU implementation of predict. Therefore, data with ‘dtype = np.float32’ should be used with this function.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
score(self, X, y, threshold=0.5, algo='auto', predict_model='GPU', convert_dtype=True, fil_sparse_format='auto')[source]

Calculates the accuracy metric score of the model for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

thresholdfloat

threshold is used to for classification This is optional and required only while performing the predict operation on the GPU.

convert_dtypeboolean, default=True

whether to convert input data to correct dtype automatically

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for classification problems.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
accuracyfloat

Accuracy of the model [0.0 - 1.0]

class cuml.ensemble.RandomForestRegressor(*, split_criterion=2, accuracy_metric='r2', handle=None, verbose=False, output_type=None, **kwargs)

Implements a Random Forest regressor model which fits multiple decision trees in an ensemble.

Note

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the n_bins parameter

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters
n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

split_criterionint or string (default = 2 ('mse'))

The criterion used to split nodes.

  • 0 or 'gini' for gini impurity

  • 1 or 'entropy' for information gain (entropy)

  • 2 or 'mse' for mean squared error

  • 4 or 'poisson' for poisson half deviance

  • 5 or 'gamma' for gamma half deviance

  • 6 or 'inverse_gaussian' for inverse gaussian deviance

0, 'gini', 1 and 'entropy' not valid for regression.

bootstrapboolean (default = True)

Control bootstrapping.

  • If True, eachtree in the forest is built on a bootstrapped sample with replacement.

  • If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Must be greater than 0. Unlimited depth (i.e, until leaves are pure) is not supported.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_featuresint, float, or string (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

  • If type int then max_features is the absolute count of features to be used.

  • If type float then max_features is used as a fraction.

  • If 'auto' then max_features=1.0.

  • If 'sqrt' then max_features=1/sqrt(n_features).

  • If 'log2' then max_features=log2(n_features)/n_features.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

  • If type int, then min_samples_leaf represents the minimum number.

  • If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

  • If type int, then min_samples_split represents the minimum number.

  • If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

The minimum decrease in impurity required for node to be split

accuracy_metricstring (default = ‘r2’)

Decides the metric used to evaluate the performance of the model. In the 0.16 release, the default scoring metric was changed from mean squared error to r-squared.

  • for r-squared : 'r2'

  • for median of abs error : 'median_ae'

  • for mean of abs error : 'mean_ae'

  • for mean square error’ : 'mse'

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

Known Limitations

This is an early release of the cuML Random Forest code. It contains a few known limitations:

  • GPU-based inference is only supported with 32-bit (float32) datatypes. Alternatives are to use CPU-based inference for 64-bit (float64) datatypes, or let the default automatic datatype conversion occur during GPU inference.

For additional docs, see scikitlearn’s RandomForestRegressor.

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestRegressor as curfr
>>> X = cp.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=cp.float32)
>>> y = cp.asarray([0.0,1.0,2.0,3.0], dtype=cp.float32)
>>> cuml_model = curfr(max_features=1.0, n_bins=128,
...                    min_samples_leaf=1,
...                    min_samples_split=2,
...                    n_estimators=40, accuracy_metric='r2')
>>> cuml_model.fit(X,y)
RandomForestRegressor()
>>> cuml_score = cuml_model.score(X,y)
>>> print("MSE score of cuml : ", cuml_score) 
MSE score of cuml :  0.9076250195503235

Methods

convert_to_fil_model(self[, output_class, ...])

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

fit(self, X, y[, convert_dtype])

Perform Random Forest Regression on the input data

get_detailed_text(self)

Obtain the detailed information for the random forest model, as text

get_json(self)

Export the Random Forest model as a JSON string

get_summary_text(self)

Obtain the text summary of the random forest model

predict(self, X[, predict_model, algo, ...])

Predicts the labels for X.

score(self, X, y[, algo, convert_dtype, ...])

Calculates the accuracy metric score of the model for X.

convert_to_fil_model(self, output_class=False, algo='auto', fil_sparse_format='auto')[source]

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

Parameters
output_classboolean (default = False)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

fil_sparse_formatboolean or string (default = ‘auto’)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
fil_model

A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model(self)[source]

Converts the cuML RF model to a Treelite model

Returns
tl_to_fil_modelTreelite version of this model
fit(self, X, y, convert_dtype=True)[source]

Perform Random Forest Regression on the input data

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_detailed_text(self)[source]

Obtain the detailed information for the random forest model, as text

get_json(self)[source]

Export the Random Forest model as a JSON string

get_summary_text(self)[source]

Obtain the text summary of the random forest model

predict(self, X, predict_model='GPU', algo='auto', convert_dtype=True, fil_sparse_format='auto') CumlArray[source]

Predicts the labels for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
score(self, X, y, algo='auto', convert_dtype=True, fil_sparse_format='auto', predict_model='GPU')[source]

Calculates the accuracy metric score of the model for X. In the 0.16 release, the default scoring metric was changed from mean squared error to r-squared.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU.

  • 'naive' - simple inference using shared memory

  • 'tree_reorg' - similar to naive but trees rearranged to be more coalescing-friendly

  • 'batch_tree_reorg' - similar to tree_reorg but predicting multiple rows per thread block

  • 'auto' - choose the algorithm automatically. Currently

  • 'batch_tree_reorg' is used for dense storage and ‘naive’ for sparse storage

convert_dtypeboolean, default=True

whether to convert input data to correct dtype automatically

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.

  • 'auto' - choose the storage type automatically (currently True is chosen by auto)

  • False - create a dense forest

  • True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
mean_square_errorfloat or
median_abs_errorfloat or
mean_abs_errorfloat

Forest Inferencing

class cuml.ForestInference(*, handle=None, output_type=None, verbose=False)

ForestInference provides GPU-accelerated inference (prediction) for random forest and boosted decision tree models.

This module does not support training models. Rather, users should train a model in another package and save it in a treelite-compatible format. (See https://github.com/dmlc/treelite) Currently, LightGBM, XGBoost and SKLearn GBDT and random forest models are supported.

Users typically create a ForestInference object by loading a saved model file with ForestInference.load. It is also possible to create it from an SKLearn model using ForestInference.load_from_sklearn. The resulting object provides a predict method for carrying out inference.

Known limitations:
  • A single row of data should fit into the shared memory of a thread block, otherwise (starting from 5000-12288 features) FIL might infer slower

  • From sklearn.ensemble, only {RandomForest,GradientBoosting,ExtraTrees}{Classifier,Regressor} models are supported. Other sklearn.ensemble models are currently not supported.

  • Importing large SKLearn models can be slow, as it is done in Python.

  • LightGBM categorical features are not supported.

  • Inference uses a dense matrix format, which is efficient for many problems but can be suboptimal for sparse datasets.

  • Only classification and regression are supported.

  • Many other random forest implementations including LightGBM, and SKLearn GBDTs make use of 64-bit floating point parameters, but the underlying library for ForestInference uses only 32-bit parameters. Because of the truncation that will occur when loading such models into ForestInference, you may observe a slight degradation in accuracy.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional usage examples, see the sample notebook at https://github.com/rapidsai/cuml/blob/branch-0.15/notebooks/forest_inference_demo.ipynb

Examples

In the example below, synthetic data is copied to the host before inference. ForestInference can also accept a numpy array directly at the cost of a slight performance overhead.

>>> # Assume that the file 'xgb.model' contains a classifier model
>>> # that was previously saved by XGBoost's save_model function.

>>> import sklearn, sklearn.datasets
>>> import numpy as np
>>> from numba import cuda
>>> from cuml import ForestInference

>>> model_path = 'xgb.model'
>>> X_test, y_test = sklearn.datasets.make_classification()
>>> X_gpu = cuda.to_device(
...     np.ascontiguousarray(X_test.astype(np.float32)))
>>> fm = ForestInference.load(
...     model_path, output_class=True) 
>>> fil_preds_gpu = fm.predict(X_gpu) 
>>> accuracy_score = sklearn.metrics.accuracy_score(y_test,
...     np.asarray(fil_preds_gpu)) 

Methods

common_load_params_docstring(func)

common_predict_params_docstring(func)

load(filename[, output_class, threshold, ...])

Returns a FIL instance containing the forest saved in filename This uses Treelite to load the saved model.

load_from_sklearn(skl_model[, output_class, ...])

Creates a FIL model using the scikit-learn model passed to the function.

load_from_treelite_model(self, model[, ...])

Creates a FIL model using the treelite model

load_using_treelite_handle(self, model_handle)

Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.

predict(self, X[, preds, safe_dtype_conversion])

Predicts the labels for X with the loaded forest model.

predict_proba(self, X[, preds, ...])

Predicts the class probabilities for X with the loaded forest model.

common_load_params_docstring(func)[source]
common_predict_params_docstring(func)[source]
static load(filename, output_class=False, threshold=0.50, algo='auto', storage_type='auto', blocks_per_sm=0, threads_per_tree=1, n_items=0, compute_shape_str=False, model_type='xgboost', handle=None)[source]

Returns a FIL instance containing the forest saved in filename This uses Treelite to load the saved model.

Parameters
filenamestring

Path to saved model file in a treelite-compatible format (See https://treelite.readthedocs.io/en/latest/treelite-api.html for more information)

output_class: boolean (default=False)

For a Classification model output_class must be True. For a Regression model output_class must be False.

algostring (default=’auto’)

Name of the algo from (from algo_t enum):

  • 'AUTO' or 'auto': Choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

  • 'NAIVE' or 'naive': Simple inference using shared memory

  • 'TREE_REORG' or 'tree_reorg': Similar to naive but trees rearranged to be more coalescing-friendly

  • 'BATCH_TREE_REORG' or 'batch_tree_reorg': Similar to TREE_REORG but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring or boolean (default=’auto’)

In-memory storage format to be used for the FIL model:

  • 'auto': Choose the storage type automatically (currently DENSE is always used)

  • False: Create a dense forest

  • True: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’

blocks_per_sminteger (default=0)

(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.

  • 0 (default): Launches the number of blocks proportional to the number of data rows

  • >= 1: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.

compute_shape_strboolean (default=False)

if True or equivalent, creates a ForestInference.shape_str (writes a human-readable forest shape description as a multiline ascii string)

model_typestring (default=”xgboost”)

Format of the saved treelite model to be load. It can be ‘xgboost’, ‘xgboost_json’, ‘lightgbm’.

Returns
fil_model

A Forest Inference model which can be used to perform inferencing on the model read from the file.

static load_from_sklearn(skl_model, output_class=False, threshold=0.50, algo='auto', storage_type='auto', blocks_per_sm=0, threads_per_tree=1, n_items=0, compute_shape_str=False, handle=None)[source]

Creates a FIL model using the scikit-learn model passed to the function. This function requires Treelite 1.0.0+ to be installed.

Parameters
skl_model

The scikit-learn model from which to build the FIL version.

output_class: boolean (default=False)

For a Classification model output_class must be True. For a Regression model output_class must be False.

algostring (default=’auto’)

Name of the algo from (from algo_t enum):

  • 'AUTO' or 'auto': Choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

  • 'NAIVE' or 'naive': Simple inference using shared memory

  • 'TREE_REORG' or 'tree_reorg': Similar to naive but trees rearranged to be more coalescing-friendly

  • 'BATCH_TREE_REORG' or 'batch_tree_reorg': Similar to TREE_REORG but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring or boolean (default=’auto’)

In-memory storage format to be used for the FIL model:

  • 'auto': Choose the storage type automatically (currently DENSE is always used)

  • False: Create a dense forest

  • True: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’

blocks_per_sminteger (default=0)

(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.

  • 0 (default): Launches the number of blocks proportional to the number of data rows

  • >= 1: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.

compute_shape_strboolean (default=False)

if True or equivalent, creates a ForestInference.shape_str (writes a human-readable forest shape description as a multiline ascii string)

Returns
fil_model

A Forest Inference model created from the scikit-learn model passed.

load_from_treelite_model(self, model, output_class=False, algo='auto', threshold=0.5, storage_type='auto', blocks_per_sm=0, threads_per_tree=1, n_items=0, compute_shape_str=False)[source]
Creates a FIL model using the treelite model

passed to the function.

Parameters
model

the trained model information in the treelite format loaded from a saved model using the treelite API https://treelite.readthedocs.io/en/latest/treelite-api.html

output_class: boolean (default=False)

For a Classification model output_class must be True. For a Regression model output_class must be False.

algostring (default=’auto’)

Name of the algo from (from algo_t enum):

  • 'AUTO' or 'auto': Choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

  • 'NAIVE' or 'naive': Simple inference using shared memory

  • 'TREE_REORG' or 'tree_reorg': Similar to naive but trees rearranged to be more coalescing-friendly

  • 'BATCH_TREE_REORG' or 'batch_tree_reorg': Similar to TREE_REORG but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring or boolean (default=’auto’)

In-memory storage format to be used for the FIL model:

  • 'auto': Choose the storage type automatically (currently DENSE is always used)

  • False: Create a dense forest

  • True: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’

blocks_per_sminteger (default=0)

(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.

  • 0 (default): Launches the number of blocks proportional to the number of data rows

  • >= 1: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.

compute_shape_strboolean (default=False)

if True or equivalent, creates a ForestInference.shape_str (writes a human-readable forest shape description as a multiline ascii string)

Returns
fil_model

A Forest Inference model which can be used to perform inferencing on the random forest/ XGBoost model.

load_using_treelite_handle(self, model_handle, output_class=False, algo='auto', storage_type='auto', threshold=0.50, blocks_per_sm=0, threads_per_tree=1, n_items=0, compute_shape_str=False)[source]

Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.

Parameters
model_handleModelhandle to the treelite forest model

(See https://treelite.readthedocs.io/en/latest/treelite-api.html for more information)

output_class: boolean (default=False)

For a Classification model output_class must be True. For a Regression model output_class must be False.

algostring (default=’auto’)

Name of the algo from (from algo_t enum):

  • 'AUTO' or 'auto': Choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

  • 'NAIVE' or 'naive': Simple inference using shared memory

  • 'TREE_REORG' or 'tree_reorg': Similar to naive but trees rearranged to be more coalescing-friendly

  • 'BATCH_TREE_REORG' or 'batch_tree_reorg': Similar to TREE_REORG but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring or boolean (default=’auto’)

In-memory storage format to be used for the FIL model:

  • 'auto': Choose the storage type automatically (currently DENSE is always used)

  • False: Create a dense forest

  • True: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’

blocks_per_sminteger (default=0)

(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.

  • 0 (default): Launches the number of blocks proportional to the number of data rows

  • >= 1: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.

compute_shape_strboolean (default=False)

if True or equivalent, creates a ForestInference.shape_str (writes a human-readable forest shape description as a multiline ascii string)

Returns
fil_model

A Forest Inference model which can be used to perform inferencing on the random forest model.

predict(self, X, preds=None, safe_dtype_conversion=False) CumlArray[source]

Predicts the labels for X with the loaded forest model. By default, the result is the raw floating point output from the model, unless output_class was set to True during model loading.

See the documentation of ForestInference.load for details.

Parameters
predsgpuarray or cudf.Series, shape = (n_samples,)

Optional ‘out’ location to store inference results

safe_dtype_conversionbool (default = False)

FIL converts data to np.float32 when needed. Set this parameter to True to enable checking for information loss during that conversion, but note that this check can have a significant performance penalty. Parameter will be dropped in a future version.

Returns
GPU array of length n_samples with inference results
(or ‘preds’ filled with inference results if preds was specified)
predict_proba(self, X, preds=None, safe_dtype_conversion=False) CumlArray[source]

Predicts the class probabilities for X with the loaded forest model. The result is the raw floating point output from the model.

Parameters
predsgpuarray or cudf.Series, shape = (n_samples,2)

Binary probability output Optional ‘out’ location to store inference results

safe_dtype_conversionbool (default = False)

FIL converts data to np.float32 when needed. Set this parameter to True to enable checking for information loss during that conversion, but note that this check can have a significant performance penalty. Parameter will be dropped in a future version.

Returns
GPU array of shape (n_samples,2) with inference results
(or ‘preds’ filled with inference results if preds was specified)

Coordinate Descent

class cuml.CD(*, loss='squared_loss', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, shuffle=True, handle=None, output_type=None, verbose=False)

Coordinate Descent (CD) is a very common optimization algorithm that minimizes along coordinate directions to find the minimum of a function.

cuML’s CD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset.algorithm The CD algorithm currently works with linear regression and ridge, lasso, and elastic-net penalties.

Parameters
loss‘squared_loss’

Only ‘squared_loss’ is supported right now. ‘squared_loss’ uses linear regression in its predict step.

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization. ‘alpha = 0’ is equivalent to an ordinary least square, solved by the LinearRegression object.

l1_ratio: float (default = 0.15)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

Whether to normalize the data or not.

max_iterint (default = 1000)

The number of times the model should iterate through the entire dataset during training

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, solver stops.

shuffleboolean (default = True)

If set to ‘True’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘True’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> import cupy as cp
>>> import cudf
>>> from cuml.solvers import CD as cumlCD

>>> cd = cumlCD(alpha=0.0)

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)

>>> y = cudf.Series(cp.array([6.0, 8.0, 9.0, 11.0], dtype=cp.float32))

>>> cd.fit(X,y)
CD()
>>> print(cd.coef_) 
0 1.001...
1 1.998...
dtype: float32
>>> print(cd.intercept_) 
3.00...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([3,2], dtype=cp.float32)
>>> X_new['col2'] = cp.array([5,5], dtype=cp.float32)

>>> preds = cd.predict(X_new)
>>> print(preds) 
0 15.997...
1 14.995...
dtype: float32
Attributes
coef_

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

fit(self, X, y, convert_dtype=False) 'CD'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Quasi-Newton

class cuml.QN(*, loss='sigmoid', fit_intercept=True, l1_strength=0.0, l2_strength=0.0, max_iter=1000, tol=0.0001, delta=None, linesearch_max_iter=50, lbfgs_memory=5, verbose=False, handle=None, output_type=None, warm_start=False, penalty_normalized=True)

Quasi-Newton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.

Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

  • Limited Memory BFGS (L-BFGS) otherwise.

cuML’s QN class can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).

Parameters
loss: ‘sigmoid’, ‘softmax’, ‘l1’, ‘l2’, ‘svc_l1’, ‘svc_l2’, ‘svr_l1’, ‘svr_l2’ (default = ‘sigmoid’).

‘sigmoid’ loss used for single class logistic regression; ‘softmax’ loss used for multiclass logistic regression; ‘l1’/’l2’ loss used for regression.

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_strength: float (default = 0.0)

l1 regularization strength (if non-zero, will run OWL-QN, else L-BFGS). Use penalty_normalized to control whether the solver divides this by the sample size.

l2_strength: float (default = 0.0)

l2 regularization strength. Use penalty_normalized to control whether the solver divides this by the sample size.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

tol: float (default = 1e-4)

The training process will stop if

norm(current_loss_grad) <= tol * max(current_loss, tol).

This differs slightly from the gtol-controlled stopping condition in scipy.optimize.minimize(method=’L-BFGS-B’):

norm(current_loss_projected_grad) <= gtol.

Note, sklearn.LogisticRegression() uses the sum of softmax/logistic loss over the input data, whereas cuML uses the average. As a result, Scikit-learn’s loss is usually sample_size times larger than cuML’s. To account for the differences you may divide the tol by the sample size; this would ensure that the cuML solver does not stop earlier than the Scikit-learn solver.

delta: Optional[float] (default = None)

The training process will stop if

abs(current_loss - previous_loss) <= delta * max(current_loss, tol).

When None, it’s set to tol * 0.01; when 0, the check is disabled. Given the current step k, parameter previous_loss here is the loss at the step k - p, where p is a small positive integer set internally.

Note, this parameter corresponds to ftol in scipy.optimize.minimize(method=’L-BFGS-B’), which is set by default to a miniscule 2.2e-9 and is not exposed in sklearn.LogisticRegression(). This condition is meant to protect the solver against doing vanishingly small linesearch steps or zigzagging. You may choose to set delta = 0 to make sure the cuML solver does not stop earlier than the Scikit-learn solver.

linesearch_max_iter: int (default = 50)

Max number of linesearch iterations per outer iteration of the algorithm.

lbfgs_memory: int (default = 5)

Rank of the lbfgs inverse-Hessian approximation. Method will use O(lbfgs_memory * D) memory.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

penalty_normalizedbool, default=True

When set to True, l1 and l2 parameters are divided by the sample size. This flag can be used to achieve a behavior compatible with other implementations, such as sklearn’s.

Notes

This class contains implementations of two popular Quasi-Newton methods:

Examples

>>> import cudf
>>> import cupy as cp

>>> # Both import methods supported
>>> # from cuml import QN
>>> from cuml.solvers import QN

>>> X = cudf.DataFrame()
>>> X['col1'] = cp.array([1,1,2,2], dtype=cp.float32)
>>> X['col2'] = cp.array([1,2,2,3], dtype=cp.float32)
>>> y = cudf.Series(cp.array([0.0, 0.0, 1.0, 1.0], dtype=cp.float32) )

>>> solver = QN()
>>> solver.fit(X,y)
QN()

>>> # Note: for now, the coefficients also include the intercept in the
>>> # last position if fit_intercept=True
>>> print(solver.coef_) 
0   37.371...
1   0.949...
dtype: float32
>>> print(solver.intercept_) 
0   -57.738...
>>> X_new = cudf.DataFrame()
>>> X_new['col1'] = cp.array([1,5], dtype=cp.float32)
>>> X_new['col2'] = cp.array([2,5], dtype=cp.float32)
>>> preds = solver.predict(X_new)
>>> print(preds)
0    0.0
1    1.0
dtype: float32
Attributes
coef_array, shape (n_classes, n_features)

QN.coef_(self)

intercept_array (n_classes, 1)

The independent term. If fit_intercept is False, will be 0.

Methods

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

score(self, X, y)

property coef_
fit(self, X, y, sample_weight=None, convert_dtype=False) 'QN'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X, convert_dtype=False) CumlArray[source]

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

score(self, X, y)[source]

Support Vector Machines

class cuml.svm.SVC(C-Support Vector Classification)

Construct an SVC classifier for training and predictions.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:

  • ‘auto’: gamma will be set to 1 / n_features

  • ‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only signifficant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

cache_sizefloat (default = 1024.0)

Size of the kernel cache during training in MiB. Increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.

class_weightdict or string (default=None)

Weights to modify the parameter C for class i to class_weight[i]*C. The string ‘balanced’ is also accepted, in which case class_weight[i] = n_samples / (n_classes * n_samples_of_class[i])

max_iterint (default = -1)

Limit the number of outer iterations in the solver. If -1 (default) then max_iter=100*n_samples

multiclass_strategystr (‘ovo’ or ‘ovr’, default ‘ovo’)

Multiclass classification strategy. 'ovo' uses OneVsOneClassifier while 'ovr' selects OneVsRestClassifier

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

probability: bool (default = False)

Enable or disable probability estimates.

random_state: int (default = None)

Seed for random number generator (used only when probability = True). Currently this argument is not used and a waring will be printed if the user provides it.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Notes

The solver uses the SMO method to fit the classifier. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2].

For additional docs, see scikitlearn’s SVC.

References

1

J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)

2

Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal of Machine Learning Research, 19, 1-5 (2018)

Examples

>>> import cupy as cp
>>> from cuml.svm import SVC
>>> X = cp.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]],
...              dtype=cp.float32);
>>> y = cp.array([-1, -1, 1, -1, 1, 1], dtype=cp.float32)
>>> clf = SVC(kernel='poly', degree=2, gamma='auto', C=1)
>>> clf.fit(X, y)
SVC()
>>> print("Predicted labels:", clf.predict(X))
Predicted labels: [-1. -1.  1. -1.  1.  1.]
Attributes
n_support_int

The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see https://github.com/rapidsai/cuml/issues/956 )

support_int, shape = (n_support)

SVC.support_(self)

support_vectors_float, shape (n_support, n_cols)

Device array of support vectors

dual_coef_float, shape = (1, n_support)

Device array of coefficients for support vectors

intercept_float

SVC.intercept_(self)

fit_status_int

0 if SVM is correctly fitted

coef_float, shape (1, n_cols)

SVMBase.coef_(self)

classes_shape (n_classes_,)

SVC.classes_(self)

n_classes_int

Number of classes

Methods

decision_function(self, X)

Calculates the decision function values for X.

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the class labels for X. The returned y values are the class

predict_log_proba(self, X)

Predicts the log probabilities for X (returns log(predict_proba(x)).

predict_proba(self, X[, log])

Predicts the class probabilities for X.

property classes_
decision_function(self, X) CumlArray[source]

Calculates the decision function values for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Decision function values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit(self, X, y, sample_weight=None, convert_dtype=True) 'SVC'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
property intercept_
predict(self, X, convert_dtype=True) CumlArray[source]

Predicts the class labels for X. The returned y values are the class labels associated to sign(decision_function(X)).

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba(self, X) CumlArray[source]

Predicts the log probabilities for X (returns log(predict_proba(x)).

The model has to be trained with probability=True to use this method.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Log of predicted probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, log=False) CumlArray[source]

Predicts the class probabilities for X.

The model has to be trained with probability=True to use this method.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

log: boolean (default = False)

Whether to return log probabilities.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)

Predicted probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

property support_
class cuml.svm.SVR(Epsilon Support Vector Regression)

Construct an SVC classifier for training and predictions.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:

  • ‘auto’: gamma will be set to 1 / n_features

  • ‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only signifficant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

epsilon: float (default = 0.1)

epsilon parameter of the epsiron-SVR model. There is no penalty associated to points that are predicted within the epsilon-tube around the target values.

cache_sizefloat (default = 1024.0)

Size of the kernel cache during training in MiB. Increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.

max_iterint (default = -1)

Limit the number of outer iterations in the solver. If -1 (default) then max_iter=100*n_samples

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional docs, see Scikit-learn’s SVR.

The solver uses the SMO method to fit the regressor. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]

References

1

J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)

2

Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal of Machine Learning Research, 19, 1-5 (2018)

Examples

>>> import cupy as cp
>>> from cuml.svm import SVR
>>> X = cp.array([[1], [2], [3], [4], [5]], dtype=cp.float32)
>>> y = cp.array([1.1, 4, 5, 3.9, 1.], dtype = cp.float32)
>>> reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1)
>>> reg.fit(X, y)
SVR()
>>> print("Predicted values:", reg.predict(X)) 
Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]
Attributes
n_support_int

The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see Issue #956)

support_int, shape = [n_support]

Device array of suppurt vector indices

support_vectors_float, shape [n_support, n_cols]

Device array of support vectors

dual_coef_float, shape = [1, n_support]

Device array of coefficients for support vectors

intercept_int

SVMBase.intercept_(self)

fit_status_int

0 if SVM is correctly fitted

coef_float, shape [1, n_cols]

SVMBase.coef_(self)

Methods

fit(self, X, y[, sample_weight, convert_dtype])

Fit the model with X and y.

predict(self, X[, convert_dtype])

Predicts the values for X.

fit(self, X, y, sample_weight=None, convert_dtype=True) 'SVR'[source]

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, convert_dtype=True) CumlArray[source]

Predicts the values for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class cuml.svm.LinearSVC(Support Vector Classification with the linear kernel)[source]

Construct a linear SVM classifier for training and predictions.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

penalty{‘l1’, ‘l2’} (default = ‘l2’)

The regularization term of the target function.

loss{‘squared_hinge’, ‘hinge’} (default = ‘squared_hinge’)

The loss term of the target function.

fit_interceptbool (default = True)

Whether to fit the bias term. Set to False if you expect that the data is already centered.

penalized_interceptbool (default = False)

When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function. Enabling this feature forces an extra copying the input data X.

max_iterint (default = 1000)

Maximum number of iterations for the underlying solver.

linesearch_max_iterint (default = 100)

Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.

lbfgs_memoryint (default = 5)

Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Cfloat (default = 1.0)
The constant scaling factor of the loss term in the target formula

F(X, y) = penalty(X) + C * loss(X, y).

grad_tolfloat (default = 0.0001)

The threshold on the gradient for the underlying QN solver.

change_tolfloat (default = 1e-05)

The threshold on the function change for the underlying QN solver.

tolOptional[float] (default = None)

Tolerance for the stopping criterion. This is a helper transient parameter that, when present, sets both grad_tol and change_tol to the same value. When any of the two ***_tol parameters are passed as well, they take the precedence.

probability: bool (default = False)

Enable or disable probability estimates.

multi_class{currently, only ‘ovr’} (default = ‘ovr’)

Multiclass classification strategy. 'ovo' uses OneVsOneClassifier while 'ovr' selects OneVsRestClassifier

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

The model uses the quasi-newton (QN) solver to find the solution in the primal space. Thus, in contrast to generic SVC model, it does not compute the support coefficients/vectors.

Check the solver’s documentation for more details Quasi-Newton (L-BFGS/OWL-QN).

For additional docs, see scikitlearn’s LinearSVC.

Examples

>>> import cupy as cp
>>> from cuml.svm import LinearSVC
>>> X = cp.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]],
...              dtype=cp.float32);
>>> y = cp.array([0, 0, 1, 0, 1, 1], dtype=cp.float32)
>>> clf = LinearSVC(loss='squared_hinge', penalty='l1', C=1)
>>> clf.fit(X, y)
LinearSVC()
>>> print("Predicted labels:", clf.predict(X))
Predicted labels: [0. 0. 1. 0. 1. 1.]
Attributes
intercept_float, shape (n_classes_,)

The constant in the decision function

coef_float, shape (n_classes_, n_cols)

The vectors defining the hyperplanes that separate the classes.

classes_float, shape (n_classes_,)

Array of class labels.

probScale_float, shape (n_classes_, 2)

Probability calibration constants (for the probabolistic output).

n_classes_int

LinearSVM.n_classes_(self) -> int

Methods

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

class cuml.svm.LinearSVR(Support Vector Regression with the linear kernel)[source]

Construct a linear SVM regressor for training and predictions.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

penalty{‘l1’, ‘l2’} (default = ‘l2’)

The regularization term of the target function.

loss{‘squared_epsilon_insensitive’, ‘epsilon_insensitive’} (default = ‘epsilon_insensitive’)

The loss term of the target function.

fit_interceptbool (default = True)

Whether to fit the bias term. Set to False if you expect that the data is already centered.

penalized_interceptbool (default = False)

When true, the bias term is treated the same way as other features; i.e. it’s penalized by the regularization term of the target function. Enabling this feature forces an extra copying the input data X.

max_iterint (default = 1000)

Maximum number of iterations for the underlying solver.

linesearch_max_iterint (default = 100)

Maximum number of linesearch (inner loop) iterations for the underlying (QN) solver.

lbfgs_memoryint (default = 5)

Number of vectors approximating the hessian for the underlying QN solver (l-bfgs).

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Cfloat (default = 1.0)
The constant scaling factor of the loss term in the target formula

F(X, y) = penalty(X) + C * loss(X, y).

grad_tolfloat (default = 0.0001)

The threshold on the gradient for the underlying QN solver.

change_tolfloat (default = 1e-05)

The threshold on the function change for the underlying QN solver.

tolOptional[float] (default = None)

Tolerance for the stopping criterion. This is a helper transient parameter that, when present, sets both grad_tol and change_tol to the same value. When any of the two ***_tol parameters are passed as well, they take the precedence.

epsilonfloat (default = 0.0)

The epsilon-sensitivity parameter for the SVR loss function.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

The model uses the quasi-newton (QN) solver to find the solution in the primal space. Thus, in contrast to generic SVC model, it does not compute the support coefficients/vectors.

Check the solver’s documentation for more details Quasi-Newton (L-BFGS/OWL-QN).

For additional docs, see scikitlearn’s LinearSVR.

Examples

>>> import cupy as cp
>>> from cuml.svm import LinearSVR
>>> X = cp.array([[1], [2], [3], [4], [5]], dtype=cp.float32)
>>> y = cp.array([1.1, 4, 5, 3.9, 8.], dtype=cp.float32)
>>> reg = LinearSVR(loss='epsilon_insensitive', C=10,
...                 epsilon=0.1, verbose=0)
>>> reg.fit(X, y)
LinearSVR()
>>> print("Predicted values:", reg.predict(X)) 
Predicted labels: [1.8993504 3.3995128 4.899675  6.399837  7.899999]
Attributes
intercept_float, shape (1,)

The constant in the decision function

coef_float, shape (1, n_cols)

The coefficients of the linear decision function.

Methods

get_param_names(self)

Returns a list of hyperparameter names owned by this class.

get_param_names(self)[source]

Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it in-turn owns. This is to simplify the implementation of get_params and set_params methods.

Nearest Neighbors Classification

class cuml.neighbors.KNeighborsClassifier(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

algorithmstring (default=’auto’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsClassifier
>>> from cuml.datasets import make_blobs
>>> from cuml.model_selection import train_test_split

>>> X, y = make_blobs(n_samples=100, centers=5,
...                   n_features=10, random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsClassifier(n_neighbors=10)

>>> knn.fit(X_train, y_train)
KNeighborsClassifier()
>>> knn.predict(X_test) 
array([1., 2., 2., 3., 4., 2., 4., 4., 2., 3., 1., 4., 3., 1., 3., 4., 3., # noqa: E501
    4., 1., 3.], dtype=float32)
Attributes
classes_
y

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors classifier model.

get_param_names(self)

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to

predict_proba(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to

fit(self, X, y, convert_dtype=True) 'KNeighborsClassifier'[source]

Fit a GPU index for k-nearest neighbors classifier model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

get_param_names(self)[source]
predict(self, X, convert_dtype=True) CumlArray[source]

Use the trained k-nearest neighbors classifier to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

Returns
X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels predicted

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba(self, X, convert_dtype=True) Union[CumlArray, Tuple][source]

Use the trained k-nearest neighbors classifier to predict the label probabilities for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

Returns
X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Labels probabilities

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Nearest Neighbors Regression

class cuml.neighbors.KNeighborsRegressor(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs)

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

algorithmstring (default=’auto’)

The query algorithm to use. Valid options are:

  • 'auto': to automatically select brute-force or random ball cover based on data shape and metric

  • 'rbc': for the random ball algorithm, which partitions the data space and uses the triangle inequality to lower the number of potential distances. Currently, this algorithm supports 2d Euclidean and Haversine.

  • 'brute': for brute-force, slow but produces exact results

  • 'ivfflat': for inverted file, divide the dataset in partitions and perform search on relevant partitions only

  • 'ivfpq': for inverted file and product quantization, same as inverted list, in addition the vectors are broken in n_features/M sub-vectors that will be encoded thanks to intermediary k-means clusterings. This encoding provide partial information allowing faster distances calculations

  • 'ivfsq': for inverted file and scalar quantization, same as inverted list, in addition vectors components are quantized into reduced binary representation allowing faster distances calculations

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

>>> from cuml.neighbors import KNeighborsRegressor
>>> from cuml.datasets import make_regression
>>> from cuml.model_selection import train_test_split

>>> X, y = make_regression(n_samples=100, n_features=10,
...                        random_state=5)
>>> X_train, X_test, y_train, y_test = train_test_split(
...   X, y, train_size=0.80, random_state=5)

>>> knn = KNeighborsRegressor(n_neighbors=10)
>>> knn.fit(X_train, y_train)
KNeighborsRegressor()
>>> knn.predict(X_test) 
array([ 14.770798  ,  51.8834    ,  66.15657   ,  46.978275  ,
    21.589611  , -14.519918  , -60.25534   , -20.856869  ,
    29.869623  , -34.83317   ,   0.45447388, 120.39675   ,
    109.94834   ,  63.57794   , -17.956171  ,  78.77663   ,
    30.412262  ,  32.575233  ,  74.72834   , 122.276855  ],
dtype=float32)
Attributes
y

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors regression model.

get_param_names(self)

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors regression model to

fit(self, X, y, convert_dtype=True) 'KNeighborsRegressor'[source]

Fit a GPU index for k-nearest neighbors regression model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

get_param_names(self)[source]
predict(self, X, convert_dtype=True) CumlArray[source]

Use the trained k-nearest neighbors regression model to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

Returns
X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)

Predicted values

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

Kernel Ridge Regression

class cuml.KernelRidge(*, alpha=1, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None, output_type=None, handle=None, verbose=False)

Kernel ridge regression (KRR) performs l2 regularised ridge regression using the kernel trick. The kernel trick allows the estimator to learn a linear function in the space induced by the kernel. This may be a non-linear function in the original feature space (when a non-linear kernel is used). This estimator supports multi-output regression (when y is 2 dimensional). See the sklearn user guide for more information.

Parameters
alphafloat or array-like of shape (n_targets,), default=1.0

Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. If an array is passed, penalties are assumed to be specific to the targets.

kernelstr or callable, default=”linear”

Kernel mapping used internally. This parameter is directly passed to pairwise_kernel. If kernel is a string, it must be one of the metrics in cuml.metrics.PAIRWISE_KERNEL_FUNCTIONS or “precomputed”. If kernel is “precomputed”, X is assumed to be a kernel matrix. kernel may be a callable numba device function. If so, is called on each pair of instances (rows) and the resulting value recorded.

gammafloat, default=None

Gamma parameter for the RBF, laplacian, polynomial, exponential chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the documentation for sklearn.metrics.pairwise. Ignored by other kernels.

degreefloat, default=3

Degree of the polynomial kernel. Ignored by other kernels.

coef0float, default=1

Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.

kernel_paramsmapping of str to any, default=None

Additional parameters (keyword arguments) for kernel function passed as callable object.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

Examples

>>> import cupy as cp
>>> from cuml.kernel_ridge import KernelRidge
>>> from numba import cuda
>>> import math

>>> n_samples, n_features = 10, 5
>>> rng = cp.random.RandomState(0)
>>> y = rng.randn(n_samples)
>>> X = rng.randn(n_samples, n_features)

>>> model = KernelRidge(kernel="poly").fit(X, y)
>>> pred = model.predict(X)

>>> @cuda.jit(device=True)
... def custom_rbf_kernel(x, y, gamma=None):
...     if gamma is None:
...         gamma = 1.0 / len(x)
...     sum = 0.0
...     for i in range(len(x)):
...         sum += (x[i] - y[i]) ** 2
...     return math.exp(-gamma * sum)

>>> model = KernelRidge(kernel=custom_rbf_kernel,
...                     kernel_params={"gamma": 2.0}).fit(X, y)
>>> pred = model.predict(X)
Attributes
dual_coef_ndarray of shape (n_samples,) or (n_samples, n_targets)

Representation of weight vector(s) in kernel space

X_fit_ndarray of shape (n_samples, n_features)

Training data, which is also required for prediction. If kernel == “precomputed” this is instead the precomputed training matrix, of shape (n_samples, n_samples).

Methods

fit(self, X, y[, sample_weight, convert_dtype])

Parameters

get_param_names(self)

predict(self, X)

Predict using the kernel ridge model.

fit(self, X, y, sample_weight=None, convert_dtype=True) 'KernelRidge'[source]
Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)[source]
predict(self, X)[source]

Predict using the kernel ridge model.

Parameters
Xarray-like of shape (n_samples, n_features)

Samples. If kernel == “precomputed” this is instead a precomputed kernel matrix, shape = [n_samples, n_samples_fitted], where n_samples_fitted is the number of samples used in the fitting for this estimator.

Returns
Carray of shape (n_samples,) or (n_samples, n_targets)

Returns predicted values.

Clustering

K-Means Clustering

class cuml.KMeans(*, handle=None, n_clusters=8, max_iter=300, tol=0.0001, verbose=False, random_state=1, init='scalable-k-means++', n_init=1, oversampling_factor=2.0, max_samples_per_batch=32768, output_type=None)

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.

cuML’s KMeans expects an array-like object or cuDF DataFrame, and supports the scalable KMeans++ initialization method. This method is more stable than randomly selecting K points.

Parameters
handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat64 (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

random_stateint (default = 1)

If you want results to be the same when you restart Python, select a state.

init{‘scalable-kmeans++’, ‘k-means||’, ‘random’} or an ndarray (default = ‘scalable-k-means++’)
  • 'scalable-k-means++' or 'k-means||': Uses fast and stable scalable kmeans++ initialization.

  • 'random': Choose n_cluster observations (rows) at random from data for the initial centroids.

  • If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init: int (default = 1)

Number of instances the k-means algorithm will be called with different seeds. The final results will be from the instance that produces lowest inertia out of n_init instances.

oversampling_factorfloat64 (default = 2.0)

The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.

max_samples_per_batchint (default = 32768)

The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Notes

KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.

Applications of KMeans

The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.

For additional docs, see scikitlearn’s Kmeans.

Examples

>>> # Both import methods supported
>>> from cuml import KMeans
>>> from cuml.cluster import KMeans
>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>>
>>> a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
...                dtype=np.float32)
>>> b = cudf.DataFrame(a)
>>> # Input:
>>> b
    0    1
0  1.0  1.0
1  1.0  2.0
2  3.0  2.0
3  4.0  3.0
>>>
>>> # Calling fit
>>> kmeans_float = KMeans(n_clusters=2)
>>> kmeans_float.fit(b)
KMeans()
>>>
>>> # Labels:
>>> kmeans_float.labels_
0    0
1    0
2    1
3    1
dtype: int32
>>> # cluster_centers:
>>> kmeans_float.cluster_centers_
    0    1
0  1.0  1.5
1  3.5  2.5
Attributes
cluster_centers_array

The coordinates of the final clusters. This represents of “mean” of each data cluster.

labels_array

Which cluster each datapoint belongs to.

Methods

fit(self, X[, sample_weight])

Compute k-means clustering with X.

fit_predict(self, X[, sample_weight])

Compute cluster centers and predict cluster index for each sample.

fit_transform(self, X[, convert_dtype, ...])

Compute clustering and transform X to cluster-distance space.

get_param_names(self)

predict(self, X[, convert_dtype, ...])

Predict the closest cluster each sample in X belongs to.

score(self, X[, y, sample_weight, convert_dtype])

Opposite of the value of X on the K-means objective.

transform(self, X[, convert_dtype])

Transform X to a cluster-distance space.

fit(self, X, sample_weight=None) 'KMeans'[source]

Compute k-means clustering with X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

fit_predict(self, X, sample_weight=None) CumlArray[source]

Compute cluster centers and predict cluster index for each sample.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit_transform(self, X, convert_dtype=False, sample_weight=None) CumlArray[source]

Compute clustering and transform X to cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)

Transformed data

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

get_param_names(self)[source]
predict(self, X, convert_dtype=False, sample_weight=None, normalize_weights=True) CumlArray[source]

Predict the closest cluster each sample in X belongs to.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

Returns
predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)

Cluster indexes

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

score(self, X, y=None, sample_weight=None, convert_dtype=True)[source]

Opposite of the value of X on the K-means objective.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

sample_weightarray-like (device or host) shape = (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the score method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
scorefloat

Opposite of the value of X on the K-means objective.

transform(self, X, convert_dtype=False) CumlArray[source]

Transform X to a cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = False)

When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)

Transformed data

For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

DBSCAN

class cuml.DBSCAN(*, eps=0.5, handle=None, min_samples=5, metric='euclidean', verbose=False, max_mbytes_per_batch=None, output_type=None, calc_core_sample_indices=True)

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects an array-like object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters
epsfloat (default = 0.5)

The maximum distance between 2 points such they reside in the same neighborhood.

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

min_samplesint (default = 5)

The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).

metric: {‘euclidean’, ‘precomputed’}, default = ‘euclidean’

The metric to use when calculating distances between points. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

max_mbytes_per_batch(optional) int64

Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

calc_core_sample_indices(optional) boolean (default = True)

Indicates whether the indices of the core samples should be calculated. The the attribute core_sample_indices_ will not be used, setting this to False will avoid unnecessary kernel launches

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For additional docs, see scikitlearn’s DBSCAN.

Examples

>>> # Both import methods supported
>>> from cuml import DBSCAN
>>> from cuml.cluster import DBSCAN
>>>
>>> import cudf
>>> import numpy as np
>>>
>>> gdf_float = cudf.DataFrame()
>>> gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
>>> gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>> gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
>>>
>>> dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
>>> dbscan_float.fit(gdf_float)
DBSCAN()
>>> dbscan_float.labels_
0    0
1    1
2    2
dtype: int32
Attributes
labels_array-like or cuDF series

Which cluster each datapoint belongs to. Noisy samples are labeled as -1. Format depends on cuml global output type and estimator output_type.

core_sample_indices_array-like or cuDF series

The indices of the core samples. Only calculated if calc_core_sample_indices==True

Methods

fit(self, X[, out_dtype])

Perform DBSCAN clustering from features.

fit_predict(self, X[, out_dtype])

Performs clustering on X and returns cluster labels.

get_param_names(self)

fit(self, X, out_dtype='int32') 'DBSCAN'[source]