cuML API Reference

Module Configuration

Output Data Type Configuration

memory_utils.set_global_output_type(output_type)

Method to set cuML’s single GPU estimators global output type. It will be used by all estimators unless overriden in their initialization with their own output_type parameter. Can also be overriden by the context manager method using_output_type

Parameters
output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)

Desired output type of results and attributes of the estimators.

‘input’ will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

Input type -> Output type

cuDF DataFrame or Series -> cuDF DataFrame or Series

NumPy arrays -> NumPy arrays

Pandas DataFrame or Series -> NumPy arrays

Numba device arrays -> Numba device arrays

CuPy arrays -> CuPy arrays

Other __cuda_array_interface__ objects -> CuPy arrays

‘cudf’ will return cuDF Series for single dimensional results and DataFrames for the rest.

‘cupy’ will return CuPy arrays.

‘numpy’ will return NumPy arrays.

Notes

‘cupy’ and ‘numba’ options (as well as ‘input’ when using Numba and CuPy ndarrays for input) have the least overhead. cuDF add memory consumption and processing time needed to build the Series and DataFrames. ‘numpy’ has the biggest overhead due to the need to transfer data to CPU memory.

Examples

import cuml
import cupy as cp

ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
ary = cp.asarray(ary)

cuml.set_global_output_type('cudf'):
dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(ary)

print("cuML output type")
print(dbscan_float.labels_)
print(type(dbscan_float.labels_))

Output:

cuML output type
0    0
1    1
2    2
dtype: int32
<class 'cudf.core.series.Series'>
memory_utils.using_output_type(output_type)

Context manager method to set cuML’s global output type inside a with statement. It gets reset to the prior value it had once the with code block is executer.

Parameters
output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)

Desired output type of results and attributes of the estimators.

‘input’ will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:

Input type -> Output type

cuDF DataFrame or Series -> cuDF DataFrame or Series

NumPy arrays -> NumPy arrays

Pandas DataFrame or Series -> NumPy arrays

Numba device arrays -> Numba device arrays

CuPy arrays -> CuPy arrays

Other __cuda_array_interface__ objects -> CuPy arrays

‘cudf’ will return cuDF Series for single dimensional results and DataFrames for the rest.

‘cupy’ will return CuPy arrays.

‘numpy’ will return NumPy arrays.

Examples

import cuml
import cupy as cp

ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]]
ary = cp.asarray(ary)

with cuml.using_output_type('cudf'):
    dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1)
    dbscan_float.fit(ary)

    print("cuML output inside `with` context")
    print(dbscan_float.labels_)
    print(type(dbscan_float.labels_))

# use cuml again outside the context manager
dbscan_float2 = cuml.DBSCAN(eps=1.0, min_samples=1)
dbscan_float2.fit(ary)

print("cuML default output")
print(dbscan_float2.labels_)
print(type(dbscan_float2.labels_))

Output:

cuML output inside `with` context
0    0
1    1
2    2
dtype: int32
<class 'cudf.core.series.Series'>

cuML default output
[0 1 2]
<class 'cupy.core.core.ndarray'>

Verbosity Levels

cuML follows a verbosity model similar to Scikit-learn’s: The verbose parameter can be a boolean, or a numeric value, and higher numeric values mean more verbosity. The exact values can be set directly, or through the cuml.common.logger module, and they are:

Verbosity Levels

Numeric value

cuml.common.logger value

Verbosity level

0

cuml.common.logger.level_off

Disables all log messages

1

cuml.common.logger.level_critical

Enables only critical messages

2

cuml.common.logger.level_error

Enables all messages up to and including errors.

3

cuml.common.logger.level_warn

Enables all messages up to and including warnings.

4 or False

cuml.common.logger.level_info

Enables all messages up to and including information messages.

5 or True

cuml.common.logger.level_debug

Enables all messages up to and including debug messages.

6

cuml.common.logger.level_trace

Enables all messages up to and including trace messages.

Preprocessing, Metrics, and Utilities

Model Selection and Data Splitting

model_selection.train_test_split(X, y, test_size: Union[float, int] = None, train_size: Union[float, int] = None, shuffle: bool = True, random_state: Union[int, cupy.random.generator.RandomState, numpy.random.mtrand.RandomState] = None, seed: Union[int, cupy.random.generator.RandomState, numpy.random.mtrand.RandomState] = None)

Partitions device data into four collated objects, mimicking Scikit-learn’s train_test_split

Parameters
Xcudf.DataFrame or cuda_array_interface compliant device array

Data to split, has shape (n_samples, n_features)

ystr, cudf.Series or cuda_array_interface compliant device array

Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X (if it is a cuDF DataFrame) containing the labels

train_sizefloat or int, optional

If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8

shufflebool, optional

Whether or not to shuffle inputs before splitting

random_stateint, CuPy RandomState or NumPy RandomState optional

If shuffle is true, seeds the generator. Unseeded by default

seed: random_stateint, CuPy RandomState or NumPy RandomState optional

Deprecated in favor of random_state. If shuffle is true, seeds the generator. Unseeded by default

Returns
X_train, X_test, y_train, y_testcudf.DataFrame or array-like objects

Partitioned dataframes if X and y were cuDF objects. If y was provided as a column name, the column was dropped from the `X`s Partitioned numba device arrays if X and y were Numba device arrays. Partitioned CuPy arrays for any other input.

Examples

import cudf
from cuml.preprocessing.model_selection import train_test_split

# Generate some sample data
df = cudf.DataFrame({'x': range(10),
                     'y': [0, 1] * 5})
print(f'Original data: {df.shape[0]} elements')

# Suppose we want an 80/20 split
X_train, X_test, y_train, y_test = train_test_split(df, 'y',
                                                    train_size=0.8)
print(f'X_train: {X_train.shape[0]} elements')
print(f'X_test: {X_test.shape[0]} elements')
print(f'y_train: {y_train.shape[0]} elements')
print(f'y_test: {y_test.shape[0]} elements')

# Alternatively, if our labels are stored separately
labels = df['y']
df = df.drop(['y'])

# we can also do
X_train, X_test, y_train, y_test = train_test_split(df, labels,
                                                    train_size=0.8)

Output:

Original data: 10 elements
X_train: 8 elements
X_test: 2 elements
y_train: 8 elements
y_test: 2 elements

Feature and Label Encoding (Single-GPU)

class cuml.preprocessing.LabelEncoder(handle_unknown='error')

An nvcategory based implementation of ordinal label encoding

Parameters
handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.

Examples

Converting a categorical implementation to a numerical one

from cudf import DataFrame, Series

data = DataFrame({'category': ['a', 'b', 'c', 'd']})

# There are two functionally equivalent ways to do this
le = LabelEncoder()
le.fit(data.category)  # le = le.fit(data.category) also works
encoded = le.transform(data.category)

print(encoded)

# This method is preferred
le = LabelEncoder()
encoded = le.fit_transform(data.category)

print(encoded)

# We can assign this to a new column
data = data.assign(encoded=encoded)
print(data.head())

# We can also encode more data
test_data = Series(['c', 'a'])
encoded = le.transform(test_data)
print(encoded)

# After train, ordinal label can be inverse_transform() back to
# string labels
ord_label = cudf.Series([0, 0, 1, 2, 1])
ord_label = dask_cudf.from_cudf(data, npartitions=2)
str_label = le.inverse_transform(ord_label)
print(str_label)

Output:

0    0
1    1
2    2
3    3
dtype: int64

0    0
1    1
2    2
3    3
dtype: int32

category  encoded
0         a        0
1         b        1
2         c        2
3         d        3

0    2
1    0
dtype: int64

0    a
1    a
2    b
3    c
4    b
dtype: object

Methods

fit

fit_transform

inverse_transform

transform

fit(self, y)

Fit a LabelEncoder (nvcategory) instance to a set of categories

Parameters
ycudf.Series

Series containing the categories to be encoded. It’s elements may or may not be unique

Returns
selfLabelEncoder

A fitted instance of itself to allow method chaining

fit_transform(self, y: cudf.core.series.Series) → cudf.core.series.Series

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)

inverse_transform(self, y: cudf.core.series.Series) → cudf.core.series.Series

Revert ordinal label to original label

Parameters
ycudf.Series, dtype=int32

Ordinal labels to be reverted

Returns
revertedcudf.Series

Reverted labels

transform(self, y: cudf.core.series.Series) → cudf.core.series.Series

Transform an input into its categorical keys.

This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.

Parameters
ycudf.Series

Input keys to be transformed. Its values should match the categories given to fit

Returns
encodedcudf.Series

The ordinally encoded input series

Raises
KeyError

if a category appears that was not seen in fit

class cuml.preprocessing.LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

A multi-class dummy encoder for labels.

Examples

Create an array with labels and dummy encode them

import cupy as cp
from cuml.preprocessing import LabelBinarizer

labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
                    dtype=cp.int32)

lb = LabelBinarizer()

encoded = lb.fit_transform(labels)

print(str(encoded)

decoded = lb.inverse_transform(encoded)

print(str(decoded)

Output:

[[1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]]

 [ 0  5 10  7  2  4  1  0  0  4  3  2  1]

Methods

fit(self, y)

Fit label binarizer

fit_transform(self, y)

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

inverse_transform(self, y[, threshold])

Transform binary labels back to original multi-class labels

transform(self, y)

Transform multi-class labels to their dummy-encoded representation labels.

fit(self, y)

Fit label binarizer

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]

Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
selfreturns an instance of self.
fit_transform(self, y)

Fit label binarizer and transform multi-class labels to their dummy-encoded representation.

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]
Returns
arrarray with encoded labels
inverse_transform(self, y, threshold=None)

Transform binary labels back to original multi-class labels

Parameters
yarray of shape [n_samples, n_classes]
thresholdfloat this value is currently ignored
Returns
arrarray with original labels
transform(self, y)

Transform multi-class labels to their dummy-encoded representation labels.

Parameters
yarray of shape [n_samples,] or [n_samples, n_classes]
Returns
arrarray with encoded labels
preprocessing.label_binarize(y, classes, neg_label=0, pos_label=1, sparse_output=False)

A stateless helper function to dummy encode multi-class labels.

Parameters
yarray-like of size [n_samples,] or [n_samples, n_classes]
classesthe set of unique classes in the input
neg_labelinteger the negative value for transformed output
pos_labelinteger the positive value for transformed output
sparse_outputbool whether to return sparse array
class cuml.preprocessing.OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=<class 'float'>, handle_unknown='error')

Encode categorical features as a one-hot numeric array. The input to this estimator should be a cuDF.DataFrame or a cupy.ndarray, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually. Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Parameters
categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.

  • DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None, a dict or a list, default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • dict/list : drop[col] is the category in feature col that should be dropped.

sparsebool, default=False

This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223

dtypenumber type, default=np.float

Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Attributes
drop_idx_array of shape (n_features,)

drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature. None if all the transformed features will be retained.

Methods

fit(self, X)

Fit OneHotEncoder to X.

fit_transform(self, X)

Fit OneHotEncoder to X, then transform X.

inverse_transform(self, X)

Convert the data back to the original representation.

transform(self, X)

Transform X using one-hot encoding.

property categories_

Returns categories used for the one hot encoding in the correct order.

fit(self, X)

Fit OneHotEncoder to X.

Parameters
XcuDF.DataFrame or cupy.ndarray, shape = (n_samples, n_features)

The data to determine the categories of each feature.

Returns
self
fit_transform(self, X)

Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).

Parameters
Xcudf.DataFrame or cupy.ndarray, shape = (n_samples, n_features)

The data to encode.

Returns
X_outsparse matrix if sparse=True else a 2-d array

Transformed input.

inverse_transform(self, X)

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

The return type is the same as the type of the input used by the first call to fit on this estimator instance.

Parameters
Xarray-like or sparse matrix, shape [n_samples, n_encoded_features]

The transformed data.

Returns
X_trcudf.DataFrame or cupy.ndarray

Inverse transformed array.

transform(self, X)

Transform X using one-hot encoding.

Parameters
Xcudf.DataFrame or cupy.ndarray

The data to encode.

Returns
X_outsparse matrix if sparse=True else a 2-d array

Transformed input.

Feature and Label Encoding (Dask-based Multi-GPU)

class cuml.dask.preprocessing.LabelBinarizer(client=None, **kwargs)

A distributed version of LabelBinarizer for one-hot encoding a collection of labels.

Examples

Create an array with labels and dummy encode them

import cupy as cp
from cuml.dask.preprocessing import LabelBinarizer

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask

cluster = LocalCUDACluster()
client = Client(cluster)

labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1],
                    dtype=cp.int32)
labels = dask.array.from_array(labels)

lb = LabelBinarizer()

encoded = lb.fit_transform(labels)

print(str(encoded.compute())

decoded = lb.inverse_transform(encoded)

print(str(decoded.compute())

Output:

[[1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]]

 [ 0  5 10  7  2  4  1  0  0  4  3  2  1]

Methods

fit(self, y)

Fit label binarizer

fit_transform(self, y)

Fit the label encoder and return transformed labels

inverse_transform(self, y[, threshold])

Invert a set of encoded labels back to original labels

transform(self, y)

Transform and return encoded labels

fit(self, y)

Fit label binarizer

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]

chunked by row. Target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
selfreturns an instance of self.
fit_transform(self, y)

Fit the label encoder and return transformed labels

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]

target values. The 2-d matrix should only contain 0 and 1, represents multilabel classification.

Returns
arrDask.Array backed by CuPy arrays containing encoded labels
inverse_transform(self, y, threshold=None)

Invert a set of encoded labels back to original labels

Parameters
yDask.Array of shape [n_samples, n_classes] containing encoded

labels

thresholdfloat This value is currently ignored
Returns
arrDask.Array backed by CuPy arrays containing original labels
transform(self, y)

Transform and return encoded labels

Parameters
yDask.Array of shape [n_samples,] or [n_samples, n_classes]
Returns
arrDask.Array backed by CuPy arrays containing encoded labels
class cuml.dask.preprocessing.OneHotEncoder(client=None, verbose=False, **kwargs)

Encode categorical features as a one-hot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

Parameters
categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’

Categories (unique values) per feature. All categories are expected to fit on one GPU.

  • ‘auto’ : Determine categories automatically from the training data.

  • DataFrame/ndarray : categories[col] holds the categories expected in the feature col.

drop‘first’, None or a dict, default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

  • None : retain all features (the default).

  • ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

  • Dict : drop[col] is the category in feature col that should be dropped.

sparsebool, default=False

This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223

dtypenumber type, default=np.float

Desired datatype of transform’s output.

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Methods

fit(self, X)

Fit a multi-node multi-gpu OneHotEncoder to X.

fit_transform(self, X[, delayed])

Fit OneHotEncoder to X, then transform X.

inverse_transform(self, X[, delayed])

Convert the data back to the original representation.

transform(self, X[, delayed])

Transform X using one-hot encoding.

fit(self, X)

Fit a multi-node multi-gpu OneHotEncoder to X.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to determine the categories of each feature.

Returns
self
fit_transform(self, X, delayed=True)

Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to encode.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
outDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed data

inverse_transform(self, X, delayed=True)

Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category.

Parameters
XCuPy backed Dask Array, shape [n_samples, n_encoded_features]

The transformed data.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
X_trDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the inverse transformed array.

transform(self, X, delayed=True)

Transform X using one-hot encoding.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

The data to encode.

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
outDask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed input.

Dataset Generation (Single-GPU)

datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=- 10.0, 10.0, shuffle=True, random_state=None, return_centers=False, order='F', dtype='float32')

Generate isotropic Gaussian blobs for clustering.

Parameters
n_samplesint or array-like, optional (default=100)

If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_featuresint, optional (default=2)

The number of features for each sample.

centersint or array of shape [n_centers, n_features], optional

(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat or sequence of floats, optional (default=1.0)

The standard deviation of the clusters.

center_boxpair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.

shuffleboolean, optional (default=True)

Shuffle the samples.

random_stateint, RandomState instance, default=None

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

return_centersbool, optional (default=False)

If True, then return the centers of each cluster

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

Returns
Xdevice array of shape [n_samples, n_features]

The generated samples.

ydevice array of shape [n_samples]

The integer labels for cluster membership of each sample.

centersdevice array, shape [n_centers, n_features]

The centers of each cluster. Only returned if return_centers=True.

See also

make_classification

a more intricate variant

Examples

>>> from sklearn.datasets import make_blobs
>>> X, y = make_blobs(n_samples=10, centers=3, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0])
>>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2,
...                   random_state=0)
>>> print(X.shape)
(10, 2)
>>> y
array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', _centroids=None, _informative_covariance=None, _redundant_covariance=None, _repeated_indices=None)

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=20)

The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)

The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)

The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)

The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,), (default=None)

The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)

The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)

The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)

If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)

Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)

Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

_centroids: array of centroids of shape (n_clusters, n_informative)
_informative_covariance: array for covariance between informative features

of shape (n_clusters, n_informative, n_informative)

_redundant_covariance: array for covariance between redundant features

of shape (n_informative, n_redundant)

_repeated_indices: array of indices for the repeated features

of shape (n_repeated, )

Returns
Xdevice array of shape [n_samples, n_features]

The generated samples.

ydevice array of shape [n_samples]

The integer labels for class membership of each sample.

Notes

The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset. How we optimized for GPUs:

  1. Firstly, we generate X from a standard univariate instead of zeros. This saves memory as we don’t need to generate univariates each time for each feature class (informative, repeated, etc.) while also providing the added speedup of generating a big matrix on GPU

  2. We generate order=F construction. We exploit the fact that X is a generated from a univariate normal, and covariance is introduced with matrix multiplications. Which means, we can generate X as a 1D array and just reshape it to the desired order, which only updates the metadata and eliminates copies

  3. Lastly, we also shuffle by construction. Centroid indices are permuted for each sample, and then we construct the data for each centroid. This shuffle works for both order=C and order=F and eliminates any need for secondary copies

References

1

I. Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Examples

from cuml.datasets.classification import make_classification

X, y = make_classification(n_samples=10, n_features=4,
                           n_informative=2, n_classes=2)

print("X:")
print(X)

print("y:")
print(y)

Output:

X:
[[-2.3249989  -0.8679415  -1.1511791   1.3525577 ]
[ 2.2933831   1.3743551   0.63128835 -0.84648645]
[ 1.6361488  -1.3233329   0.807027   -0.894092  ]
[-1.0093077  -0.9990691  -0.00808992  0.00950443]
[ 0.99803793  2.068382    0.49570698 -0.8462848 ]
[-1.2750955  -0.9725835  -0.2390058   0.28081596]
[-1.3635055  -0.9637669  -0.31582272  0.37106958]
[ 1.1893625   2.227583    0.48750278 -0.8737561 ]
[-0.05753583 -1.0939395   0.8188342  -0.9620734 ]
[ 0.47910076  0.7648213  -0.17165393  0.26144698]]

y:
[0 1 0 0 1 0 0 1 0 1]
datasets.make_regression()

Generate a random regression problem.

See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=2)

The number of features.

n_informativeint, optional (default=2)

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)

The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)
if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if effective_rank is not None.

noisefloat, optional (default=0.0)

The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

coefboolean, optional (default=False)

If True, the coefficients of the underlying linear model are returned.

random_stateint, RandomState instance or None (default)

Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)

Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’.

handle: cuml.Handle

If it is None, a new one is created just for this function call

Returns
outdevice array of shape [n_samples, n_features]

The input samples.

valuesdevice array of shape [n_samples, n_targets]

The output values.

coefdevice array of shape [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

datasets.make_arima()

Generates a dataset of time series by simulating an ARIMA process of a given order.

Parameters
batch_size: int

Number of time series to generate

n_obs: int

Number of observations per series

orderTuple[int, int, int]

Order (p, d, q) of the simulated ARIMA process

seasonal_order: Tuple[int, int, int, int]

Seasonal ARIMA order (P, D, Q, s) of the simulated ARIMA process

intercept: bool or int

Whether to include a constant trend mu in the simulated ARIMA process

random_state: int, RandomState instance or None (default)

Seed for the random number generator for dataset creation.

dtype: string or numpy dtype (default: ‘single’)

Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’

output_type: {‘cudf’, ‘cupy’, ‘numpy’}

Type of the returned dataset

handle: cuml.Handle

If it is None, a new one is created just for this function call

Dataset Generation (Dask-based Multi-GPU)

cuml.dask.datasets.blobs.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=- 10, 10, shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None)

Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.

This function calls make_blobs from cuml.datasets on each Dask worker and aggregates them into a single Dask Dataframe.

For more information on Scikit-learn’s make_blobs:.

Parameters
n_samplesint

number of rows

n_featuresint

number of features

centersint or array of shape [n_centers, n_features],

optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat (default = 1.0)

standard deviation of points around centroid

n_partsint (default = None)

number of partitions to generate (this can be greater than the number of workers)

center_boxtuple (int, int) (default = (-10, 10))

the bounding box which constrains all the centroids

random_stateint (default = None)

sets random seed (or use None to reinitialize each time)

return_centersbool, optional (default=False)

If True, then return the centers of each cluster

verboseint or boolean (default = False)

Logging level.

shufflebool (default=False)

Shuffles the samples on each worker.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

clientdask.distributed.Client (optional)

Dask client to use

Returns
Xdask.array backed by CuPy array of shape [n_samples, n_features]

The input samples.

ydask.array backed by CuPy array of shape [n_samples]

The output values.

centersdask.array backed by CuPy array of shape

[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.

cuml.dask.datasets.classification.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', n_parts=None, client=None)

Generate a random n-class classification problem.

This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling, X horizontally stacks features in the following order: the primary n_informative features, followed by n_redundant linear combinations of the informative features, followed by n_repeated duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columns X[:, :n_informative + n_redundant + n_repeated].

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=20)

The total number of features. These comprise n_informative informative features, n_redundant redundant features, n_repeated duplicated features and n_features-n_informative-n_redundant-n_repeated useless features drawn at random.

n_informativeint, optional (default=2)

The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundantint, optional (default=2)

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeatedint, optional (default=0)

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classesint, optional (default=2)

The number of classes (or labels) of the classification problem.

n_clusters_per_classint, optional (default=2)

The number of clusters per class.

weightsarray-like of shape (n_classes,) or (n_classes - 1,), (default=None)

The proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_yfloat, optional (default=0.01)

The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.

class_sepfloat, optional (default=1.0)

The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.

hypercubeboolean, optional (default=True)

If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.

shiftfloat, array of shape [n_features] or None, optional (default=0.0)

Shift features by the specified value. If None, then features are shifted by a random value drawn in [-class_sep, class_sep].

scalefloat, array of shape [n_features] or None, optional (default=1.0)

Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.

shuffleboolean, optional (default=True)

Shuffle the samples and the features.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

n_partsint (default = None)

number of partitions to generate (this can be greater than the number of workers)

Returns
Xdask.array backed by CuPy array of shape [n_samples, n_features]

The generated samples.

ydask.array backed by CuPy array of shape [n_samples]

The integer labels for class membership of each sample.

Notes

How we extended the dask MNMG version from the single GPU version:

  1. We generate centroids of shape (n_centroids, n_informative)

  2. We generate an informative covariance of shape (n_centroids, n_informative, n_informative)

  3. We generate a redundant covariance of shape (n_informative, n_redundant)

4. We generate the indices for the repeated features We pass along the references to the futures of the above arrays with each part to the single GPU cuml.datasets.classification.make_classification so that each part (and worker) has access to the correct values to generate data from the same covariances

Examples

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from cuml.dask.datasets.classification import make_classification
cluster = LocalCUDACluster()
client = Client(cluster)
X, y = make_classification(n_samples=10, n_features=4,
                           n_informative=2, n_classes=2)

print("X:")
print(X.compute())

print("y:")
print(y.compute())

Output:

X:
[[-1.6990056  -0.8241044  -0.06997631  0.45107925]
[-1.8105277   1.7829906   0.492909    0.05390119]
[-0.18290454 -0.6155432   0.6667889  -1.0053712 ]
[-2.7530136  -0.888528   -0.5023055   1.3983376 ]
[-0.9788184  -0.89851004  0.10802134 -0.10021686]
[-0.76883423 -1.0689086   0.01249526 -0.1404741 ]
[-1.5676656  -0.83082974 -0.03072987  0.34499463]
[-0.9381793  -1.0971068  -0.07465998  0.02618019]
[-1.3021476  -0.87076336  0.02249984  0.15187258]
[ 1.1820307   1.7524253   1.5087451  -2.4626074 ]]

y:
[0 1 0 0 0 0 0 0 0 1]
cuml.dask.datasets.regression.make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None, n_parts=1, n_samples_per_part=None, dtype='float32')

Generate a mostly low rank matrix with bell-shaped singular values

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=100)

The number of features.

effective_rankint, optional (default=10)

The approximate number of singular vectors required to explain most of the data by linear combinations.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile.

random_stateint, CuPy RandomState instance, Dask RandomState instance

or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)

The number of parts of work.

dtype: str, optional (default=’float32’)

dtype of generated data

Returns
XDask-CuPy array of shape [n_samples, n_features]

The matrix.

cuml.dask.datasets.regression.make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=False, coef=False, random_state=None, n_parts=1, n_samples_per_part=None, order='F', dtype='float32', client=None, use_full_low_rank=True)

Generate a random regression problem. The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.

The output is generated by applying a (potentially biased) random linear regression model with “n_informative” nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

Parameters
n_samplesint, optional (default=100)

The number of samples.

n_featuresint, optional (default=100)

The number of features.

n_informativeint, optional (default=10)

The number of informative features, i.e., the number of features used to build the linear model used to generate the output.

n_targetsint, optional (default=1)

The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.

biasfloat, optional (default=0.0)

The bias term in the underlying linear model.

effective_rankint or None, optional (default=None)
if not None:

The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.

if None:

The input set is well conditioned, centered and gaussian with unit variance.

tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)

The relative importance of the fat noisy tail of the singular values profile if “effective_rank” is not None.

noisefloat, optional (default=0.0)

The standard deviation of the gaussian noise applied to the output.

shuffleboolean, optional (default=False)

Shuffle the samples and the features.

coefboolean, optional (default=False)

If True, the coefficients of the underlying linear model are returned.

random_stateint, CuPy RandomState instance, Dask RandomState instance

or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.

n_partsint, optional (default=1)

The number of parts of work.

orderstr, optional (default=’F’)

Row-major or Col-major

dtype: str, optional (default=’float32’)

dtype of generated data

use_full_low_rankboolean (default=True)

Whether to use the entire dataset to generate the low rank matrix. If False, it creates a low rank covariance and uses the corresponding covariance to generate a multivariate normal distribution on the remaining chunks

Returns
XDask-CuPy array of shape [n_samples, n_features]

The input samples.

yDask-CuPy array of shape [n_samples] or [n_samples, n_targets]

The output values.

coefDask-CuPy array of shape [n_features]

or [n_features, n_targets], optional

The coefficient of the underlying linear model. It is returned only if coef is True.

Notes

  • Known Performance Limitations:
    1. When effective_rank is set and use_full_low_rank is True, we cannot generate order F by construction, and an explicit transpose is performed on each part. This may cause memory to spike (other parameters make order F by construction)

    2. When n_targets > 1 and order = ‘F’ as above, we have to explicity transpose the y array. If coef = True, then we also explicity transpose the ground_truth array

    3. When shuffle = True and order = F, there are memory spikes to shuffle the F order arrays

  • NOTE: If out-of-memory errors are encountered in any of the above configurations, try increasing the n_parts parameter.

Metrics

cuml.metrics.regression.mean_absolute_error(*args, **kwargs)

Mean absolute error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

Returns
lossfloat or ndarray of floats

If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.

MAE output is non-negative floating point. The best value is 0.0.

cuml.metrics.regression.mean_squared_error(*args, **kwargs)

Mean squared error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)

If True returns MSE value, if False returns RMSE value.

Returns
lossfloat or ndarray of floats

A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.mean_squared_log_error(*args, **kwargs)

Mean squared log error regression loss

Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.

Parameters
y_truearray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Ground truth (correct) target values.

y_predarray-like (device or host) shape = (n_samples,)

or (n_samples, n_outputs) Estimated target values.

sample_weightarray-like (device or host) shape = (n_samples,), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values. Array-like value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.

squaredboolean value, optional (default = True)

If True returns MSE value, if False returns RMSE value.

Returns
lossfloat or ndarray of floats

A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

cuml.metrics.regression.r2_score(y, y_hat, convert_dtype=False, handle=None)

Calculates r2 score between y and y_hat

Parameters
yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y_hatarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y_hat to be the same data type as y if they differ. This will increase memory used for the method.

Returns
trustworthiness scoredouble

Trustworthiness of the low-dimensional embedding

cuml.metrics.accuracy.accuracy_score(ground_truth, predictions, handle=None, convert_dtype=True)

Calcuates the accuracy score of a classification model.

Parameters
handlecuml.Handle
predictionNumPy ndarray or Numba device

The labels predicted by the model for the test dataset

ground_truthNumPy ndarray, Numba device

The ground truth labels of the test dataset

Returns
float

The accuracy of the model used for prediction

cuml.metrics.trustworthiness.trustworthiness(X, X_embedded, handle=None, n_neighbors=5, metric='euclidean', should_downcast=True, convert_dtype=False, batch_size=512)

Expresses to what extent the local structure is retained in embedding. The score is defined in the range [0, 1].

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

X_embeddedarray-like (device or host) shape= (n_samples, n_features)

Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

n_neighborsint, optional (default: 5)

Number of neighbors considered

convert_dtypebool, optional (default = False)

When set to True, the trustworthiness method will automatically convert the inputs to np.float32.

Returns
trustworthiness scoredouble

Trustworthiness of the low-dimensional embedding

cuml.metrics.cluster.adjustedrandindex.adjusted_rand_score(labels_true, labels_pred, handle=None, convert_dtype=True)

Adjusted_rand_score is a clustering similarity metric based on the Rand index and is corrected for chance.

Parameters
labels_trueGround truth labels to be used as a reference

labels_pred : Array of predicted labels used to evaluate the model

handle : cuml.Handle

Returns
float

The adjusted rand index value between -1.0 and 1.0

cuml.metrics.cluster.entropy.cython_entropy(clustering, base=None, handle=None)

Computes the entropy of a distribution for given probability values.

Parameters
clusteringarray-like (device or host) shape = (n_samples,)

Clustering of labels. Probabilities are computed based on occurrences of labels. For instance, to represent a fair coin (2 equally possible outcomes), the clustering could be [0,1]. For a biased coin with 2/3 probability for tail, the clustering could be [0, 0, 1].

base: float, optional

The logarithmic base to use, defaults to e (natural logarithm).

handlecuml.Handle

Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.

Returns
Sfloat

The calculated entropy.

Benchmarking

class cuml.benchmark.algorithms.AlgorithmPair(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)

Wraps a cuML algorithm and (optionally) a cpu-based algorithm (typically scikit-learn, but does not need to be as long as it offers fit and predict or transform methods). Provides mechanisms to run each version with default arguments. If no CPU-based version of the algorithm is available, pass None for the cpu_class when instantiating

Parameters
cpu_classclass

Class for CPU version of algorithm. Set to None if not available.

cuml_classclass

Class for cuML algorithm

shared_argsdict

Arguments passed to both implementations’s initializer

cuml_argsdict

Arguments only passed to cuml’s initializer

cpu_args dict

Arguments only passed to sklearn’s initializer

accepts_labelsboolean

If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.

data_prep_hookfunction (data -> data)

Optional function to run on input data before passing to fit

accuracy_functionfunction (y_test, y_pred)

Function that returns a scalar representing accuracy

bench_funccustom function to perform fit/predict/transform

calls.

Methods

run_cpu(self, data, **override_args)

Runs the cpu-based algorithm’s fit method on specified data

run_cuml(self, data, **override_args)

Runs the cuml-based algorithm’s fit method on specified data

setup_cpu

setup_cuml

run_cpu(self, data, **override_args)

Runs the cpu-based algorithm’s fit method on specified data

run_cuml(self, data, **override_args)

Runs the cuml-based algorithm’s fit method on specified data

cuml.benchmark.algorithms.algorithm_by_name(name)

Returns the algorithm pair with the name ‘name’ (case-insensitive)

cuml.benchmark.algorithms.all_algorithms()

Returns all defined AlgorithmPair objects

Wrappers to run ML benchmarks

class cuml.benchmark.runners.AccuracyComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)

Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.

class cuml.benchmark.runners.BenchmarkTimer(reps=1)

Provides a context manager that runs a code block reps times and records results to the instance variable timings. Use like:

timer = BenchmarkTimer(rep=5)
for _ in timer.benchmark_runs():
    ... do something ...
print(np.min(timer.timings))

Methods

benchmark_runs

class cuml.benchmark.runners.SpeedupComparisonRunner(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)

Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.

Methods

run

cuml.benchmark.runners.run_variations(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], input_type='numpy', test_fraction=0.1, run_cpu=True, raise_on_error=False, n_reps=1)

Runs each algo in algos once per bench_rows X bench_dims X params_override_list X cuml_param_override_list combination and returns a dataframe containing timing and accuracy data.

Parameters
algosstr or list

Name of algorithms to run and evaluate

dataset_namestr

Name of dataset to use

bench_rowslist of int

Dataset row counts to test

bench_dimslist of int

Dataset column counts to test

param_override_listlist of dict

Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.

cuml_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cuml algo only.

cpu_param_override_listlist of dict

Dicts containing parameters to pass to __init__ of the cpu algo only.

dataset_param_override_listdict

Dicts containing parameters to pass to dataset generator function

test_fractionfloat

The fraction of data to use for testing.

run_cpuboolean

If True, run the cpu-based algorithm for comparison

Data generators for cuML benchmarks

The main entry point for consumers is gen_data, which wraps the underlying data generators.

Notes when writing new generators:

Each generator is a function that accepts:
  • n_samples (set to 0 for ‘default’)

  • n_features (set to 0 for ‘default’)

  • random_state

  • (and optional generator-specific parameters)

The function should return a 2-tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)

A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.

cuml.benchmark.datagen.gen_data(dataset_name, dataset_format, n_samples=0, n_features=0, random_state=42, test_fraction=0.0, **kwargs)

Returns a tuple of data from the specified generator.

Parameters
dataset_namestr

Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)

dataset_formatstr

Type of data to return. (One of cudf, numpy, pandas, gpuarray)

n_samplesint

Number of samples to include in training set (regardless of test split)

test_fractionfloat

Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.

cuml.benchmark.datagen.load_higgs()

Returns the Higgs Boson dataset as an X, y tuple of dataframes.

Regression and Classification

Linear Regression

class cuml.LinearRegression

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.

Parameters
algorithm‘eig’ or ‘svd’ (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.

fit_interceptboolean (default = True)

If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

This parameter is ignored when fit_intercept is set to False. If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.

Applications of LinearRegression

LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).

For additional information, see scikitlearn’s OLS documentation.

For an additional example see the OLS notebook.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import LinearRegression
from cuml.linear_model import LinearRegression

lr = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

reg = lr.fit(X,y)
print("Coefficients:")
print(reg.coef_)
print("Intercept:")
print(reg.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = lr.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Predictions:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts y values for X.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)
predict(self, X, convert_dtype=False)

Predicts y values for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Logistic Regression

class cuml.LogisticRegression

LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.

cuML’s LogisticRegression can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides both single-class (using sigmoid loss) and multiple-class (using softmax loss) variants, depending on the input variables

Only one solver option is currently available: Quasi-Newton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

  • Limited Memory BFGS (L-BFGS) otherwise.

Note that, just like in Scikit-learn, the bias will not be regularized.

Parameters
penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)

Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then L-BFGS solver will be used. If ‘l1’ is selected, solver OWL-QN will be used. If ‘elasticnet’ is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used.

tol: float (default = 1e-4)

The training process will stop if current_loss > previous_loss - tol

C: float (default = 1.0)

Inverse of regularization strength; must be a positive float.

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

class_weight: None

Custom class weighs are currently not supported.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

linesearch_max_iter: int (default = 50)

Max number of linesearch iterations per outer iteration used in the lbfgs and owl QN solvers.

verboseint or boolean (default = False)

Controls verbose level of logging.

l1_ratio: float or None, optional (default=None)

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1

solver: ‘qn’, ‘lbfgs’, ‘owl’ (default=’qn’).

Algorithm to use in the optimization problem. Currently only qn is supported, which automatically selects either L-BFGS or OWL-QN depending on the conditions of the l1 regularization described above. Options ‘lbfgs’ and ‘owl’ are just convenience values that end up using the same solver following the same rules.

Notes

cuML’s LogisticRegression uses a different solver that the equivalent Scikit-learn, except when there is no penalty and solver=lbfgs is used in Scikit-learn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to using different solvers in Scikit-learn.

For additional information, see Scikit-learn’s LogistRegression <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_.

Examples

import cudf
import numpy as np

# Both import methods supported
# from cuml import LogisticRegression
from cuml.linear_model import LogisticRegression

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) )

reg = LogisticRegression()
reg.fit(X,y)

print("Coefficients:")
print(reg.coef_.to_output('cupy'))
print("Intercept:")
print(reg.intercept_.to_output('cupy'))

X_new = cudf.DataFrame()
X_new['col1'] = np.array([1,5], dtype = np.float32)
X_new['col2'] = np.array([2,5], dtype = np.float32)

preds = reg.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:
            0.22309814
            0.21012752
Intercept:
            -0.7548761
Predictions:
            0    0.0
            1    1.0
Attributes
coef_: dev array, dim (n_classes, n_features) or (n_classes, n_features+1)

The estimated coefficients for the linear regression model. Note: this includes the intercept as the last column if fit_intercept is True

intercept_: device array (n_classes, 1)

The independent term. If fit_intercept_ is False, will be 0.

Methods

decision_function(self, X[, convert_dtype])

Gives confidence score for X

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

predict_log_proba(self, X[, convert_dtype])

Predicts the log class probabilities for each class in X

predict_proba(self, X[, convert_dtype])

Predicts the class probabilities for each class in X

score(self, X, y[, convert_dtype])

Calculates the accuracy metric score of the model for X.

decision_function(self, X, convert_dtype=False)

Gives confidence score for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: array-like (device)

Dense matrix (floats or doubles) of shape (n_samples, n_classes)

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

predict_log_proba(self, X, convert_dtype=False)

Predicts the log class probabilities for each class in X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: array-like (device)

Dense matrix (floats or doubles) of shape (n_samples, n_classes)

predict_proba(self, X, convert_dtype=False)

Predicts the class probabilities for each class in X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: array-like (device)

Dense matrix (floats or doubles) of shape (n_samples, n_classes)

score(self, X, y, convert_dtype=False)

Calculates the accuracy metric score of the model for X.

Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Observations for which labels score will be calculated. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Ground truth labels to compare predictions to for the score. Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Ridge Regression

class cuml.Ridge

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s Ridge can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It provides 3 algorithms: SVD, Eig and CD to fit a linear model. In general SVD uses significantly more memory and is slower than Eig. If using CUDA 10.1, the memory difference is even bigger than in the other supported CUDA versions. However, SVD is more stable than Eig (default). CD uses Coordinate Descent and can be faster when data is large.

Parameters
alphafloat (default = 1.0)

Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.

solver{‘eig’, ‘svd’, ‘cd’} (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.

fit_interceptboolean (default = True)

If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretability on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.

Applications of Ridge

Ridge Regression is used in the same way as LinearRegression, but does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.

For additional docs, see Scikit-learn’s Ridge Regression.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import Ridge
from cuml.linear_model import Ridge

alpha = np.array([1e-5])
ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False,
              solver = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

result_ridge = ridge.fit(X, y)
print("Coefficients:")
print(result_ridge.coef_)
print("Intercept:")
print(result_ridge.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_ridge.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Preds:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Lasso Regression

class cuml.Lasso

Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.

cuML’s Lasso can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant), in addition to cuDF objects. It uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

If it is None, a new one is created just for this class.

Notes

For additional docs, see scikitlearn’s Lasso.

Examples

import numpy as np
import cudf
from cuml.linear_model import Lasso

ls = Lasso(alpha = 0.1)

X = cudf.DataFrame()
X['col1'] = np.array([0, 1, 2], dtype = np.float32)
X['col2'] = np.array([0, 1, 2], dtype = np.float32)

y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )

result_lasso = ls.fit(X, y)
print("Coefficients:")
print(result_lasso.coef_)
print("intercept:")
print(result_lasso.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_lasso.predict(X_new)

print(preds)

Output:

Coefficients:

            0 0.85
            1 0.0

Intercept:
            0.149999

Preds:

            0 2.7
            1 1.85
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_params(self[, deep])

Scikit-learn style function that returns the estimator parameters.

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the transform method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Scikit-learn style function that returns the estimator parameters.

Parameters
deepboolean (default = True)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

ElasticNet Regression

class cuml.ElasticNet

ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

l1_ratio: float (default = 0.5)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

If it is None, a new one is created just for this class.

output_type(optional) {‘input’, ‘cudf’, ‘cupy’, ‘numpy’} default = None

Use it to control output type of the results and attributes. If None it’ll inherit the output type set at the module level, cuml.output_type. If that has not been changed, by default the estimator will mirror the type of the data used for each fit or predict call. If set, the estimator will override the global option for its behavior.

Notes

For additional docs, see scikitlearn’s ElasticNet.

Examples

import numpy as np
import cudf
from cuml.linear_model import ElasticNet

enet = ElasticNet(alpha = 0.1, l1_ratio=0.5)

X = cudf.DataFrame()
X['col1'] = np.array([0, 1, 2], dtype = np.float32)
X['col2'] = np.array([0, 1, 2], dtype = np.float32)

y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )

result_enet = enet.fit(X, y)
print("Coefficients:")
print(result_enet.coef_)
print("intercept:")
print(result_enet.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_enet.predict(X_new)

print(preds)

Output:

Coefficients:

            0 0.448408
            1 0.443341

Intercept:
            0.1082506

Preds:

            0 3.67018
            1 3.22177
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_params(self[, deep])

Scikit-learn style function that returns the estimator parameters.

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the transform method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Scikit-learn style function that returns the estimator parameters.

Parameters
deepboolean (default = True)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Mini Batch SGD Classifier

class cuml.MBSGDClassifier

Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with mini-batch SGD.

Parameters
loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘squared_loss’)

‘hinge’ uses linear SVM

‘log’ uses logistic regression

‘squared_loss’ uses linear regression

penalty: {‘none’, ‘l1’, ‘l2’, ‘elasticnet’} (default = ‘none’)

‘none’ does not perform any regularization

‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients

‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients

‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’}

(default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’}, optional

Variable to control output type of the results and attributes of the estimators. If None, it’ll inherit the output type set at the module level, cuml.output_type. If set, the estimator will override the global option for its behavior.

Notes

For additional docs, see scikitlearn’s SGDClassifier.

Examples

import numpy as np
import cudf
from cuml.linear_model import MBSGDClassifier as cumlMBSGDClassifier
X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
pred_data = cudf.DataFrame()
pred_data['col1'] = np.asarray([3, 2], dtype=np.float32)
pred_data['col2'] = np.asarray([5, 5], dtype=np.float32)
cu_mbsgd_classifier = cumlMBSGClassifier(learning_rate='constant',
                                         eta0=0.05, epochs=2000,
                                         fit_intercept=True,
                                         batch_size=1, tol=0.0,
                                         penalty='l2',
                                         loss='squared_loss',
                                         alpha=0.5)
cu_mbsgd_classifier.fit(X, y)
cu_pred = cu_mbsgd_classifier.predict(pred_data).to_array()
print(" cuML intercept : ", cu_mbsgd_classifier.intercept_)
print(" cuML coef : ", cu_mbsgd_classifier.coef_)
print("cuML predictions : ", cu_pred)

Output:

cuML intercept :  0.7150013446807861
cuML coef :  0    0.27320495
            1     0.1875956
            dtype: float32
cuML predictions :  [1. 1.]

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_params(self[, deep])

Scikit-learn style function that returns the estimator parameters.

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Scikit-learn style function that returns the estimator parameters.

Parameters
deepboolean (default = True)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: Type specified by output_type

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Mini Batch SGD Regressor

class cuml.MBSGDRegressor

Linear regression model fitted by minimizing a regularized empirical loss with mini-batch SGD.

Parameters
loss‘squared_loss’ (default = ‘squared_loss’)

‘squared_loss’ uses linear regression

penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’}

(default = ‘constant’)

optimal option will be supported in a future version

constant keeps the learning rate constant

adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divided by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’}, optional

Variable to control output type of the results and attributes of the estimators. If None, it’ll inherit the output type set at the module level, cuml.output_type. If set, the estimator will override the global option for its behavior.

Notes

For additional docs, see scikitlearn’s SGDRegressor.

Examples

import numpy as np
import cudf
from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor
X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
pred_data = cudf.DataFrame()
pred_data['col1'] = np.asarray([3, 2], dtype=np.float32)
pred_data['col2'] = np.asarray([5, 5], dtype=np.float32)
cu_mbsgd_regressor = cumlMBSGDRegressor(learning_rate='constant',
                                        eta0=0.05, epochs=2000,
                                        fit_intercept=True,
                                        batch_size=1, tol=0.0,
                                        penalty='l2',
                                        loss='squared_loss',
                                        alpha=0.5)
cu_mbsgd_regressor.fit(X, y)
cu_pred = cu_mbsgd_regressor.predict(pred_data).to_array()
print(" cuML intercept : ", cu_mbsgd_regressor.intercept_)
print(" cuML coef : ", cu_mbsgd_regressor.coef_)
print("cuML predictions : ", cu_pred)

Output:

cuML intercept :  0.7150013446807861
cuML coef :  0    0.27320495
            1     0.1875956
            dtype: float32
cuML predictions :  [2.4725943 2.1993892]

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_params(self[, deep])

Scikit-learn style function that returns the estimator parameters.

predict(self, X[, convert_dtype])

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Scikit-learn style function that returns the estimator parameters.

Parameters
deepboolean (default = True)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: Type specified by output_type

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Stochastic Gradient Descent

class cuml.SGD

Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.

cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.

Parameters
loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)

‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression

penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.001)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate‘optimal’, ‘constant’, ‘invscaling’,

‘adaptive’ (default = ‘constant’)

optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’}, optional

Variable to control output type of the results and attributes of the estimators. If None, it’ll inherit the output type set at the module level, cuml.output_type. If set, the estimator will override the global option for its behavior.

Examples

import numpy as np
import cudf
from cuml.solvers import SGD as cumlSGD
X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
pred_data = cudf.DataFrame()
pred_data['col1'] = np.asarray([3, 2], dtype=dtype)
pred_data['col2'] = np.asarray([5, 5], dtype=dtype)
cu_sgd = cumlSGD(learning_rate=lrate, eta0=0.005, epochs=2000,
                fit_intercept=True, batch_size=2,
                tol=0.0, penalty=penalty, loss=loss)
cu_sgd.fit(X, y)
cu_pred = cu_sgd.predict(pred_data).to_array()
print(" cuML intercept : ", cu_sgd.intercept_)
print(" cuML coef : ", cu_sgd.coef_)
print("cuML predictions : ", cu_pred)

Output:

cuML intercept :  0.004561662673950195
cuML coef :  0      0.9834546
            1    0.010128272
           dtype: float32
cuML predictions :  [3.0055666 2.0221121]

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

predict(self, X[, convert_dtype])

Predicts the y for X.

predictClass(self, X[, convert_dtype])

Predicts the y for X.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: Type specified in output_type

Dense vector (floats or doubles) of shape (n_samples, 1)

predictClass(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predictClass method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
yType specified in output_type

Dense vector (floats or doubles) of shape (n_samples, 1)

Random Forest

class cuml.ensemble.RandomForestClassifier

Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a histogram-based algorithms to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.

Known Limitations: This is an early release of the cuML Random Forest code. It contains a few known limitations:

  • GPU-based inference is only supported if the model was trained with 32-bit (float32) datatypes. CPU-based inference may be used in this case as a slower fallback.

  • Very deep / very wide models may exhaust available GPU memory. Future versions of cuML will provide an alternative algorithm to reduce memory consumption.

Parameters
n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

handlecuml.Handle

If it is None, a new one is created just for this class.

split_criterionThe criterion used to split nodes.

0 for GINI, 1 for ENTROPY 2 and 3 not valid for classification (default = 0)

split_algoint (default = 1)

The algorithm to determine how nodes are split in the tree. 0 for HIST and 1 for GLOBAL_QUANTILE. HIST curently uses a slower tree-building algorithm so GLOBAL_QUANTILE is recommended for most cases.

bootstrapboolean (default = True)

Control bootstrapping. If True, each tree in the forest is built on a bootstrapped sample with replacement. If False, sampling without replacement is done.

bootstrap_featuresboolean (default = False)

Control bootstrapping for features. If features are drawn with or without replacement

rows_samplefloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1. Unlimited depth is not supported. Note that this default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.

max_featuresint, float, or string (default = ‘auto’)

Ratio of number of features (columns) to consider per node split. If int then max_features/n_features. If float then max_features is used as a fraction. If ‘auto’ then max_features=1/sqrt(n_features). If ‘sqrt’ then max_features=1/sqrt(n_features). If ‘log2’ then max_features=log2(n_features)/n_features.

n_binsint (default = 8)

Number of bins used by the split algorithm.

min_rows_per_nodeint or float (default = 2)

The minimum number of samples (rows) needed to split a node. If int then number of sample rows. If float the min_rows_per_sample*n_rows

min_impurity_decreasefloat (default = 0.0)

Minimum decrease in impurity requried for node to be spilt.

quantile_per_treeboolean (default = False)

Whether quantile is computed for individal trees in RF. Only relevant for GLOBAL_QUANTILE split_algo.

seedint (default = None)

Seed for the random number generator. Unseeded by default.

Examples

import numpy as np
from cuml.ensemble import RandomForestClassifier as cuRFC

X = np.random.normal(size=(10,4)).astype(np.float32)
y = np.asarray([0,1]*5, dtype=np.int32)

cuml_model = cuRFC(max_features=1.0,
                   n_bins=8,
                   n_estimators=40)
cuml_model.fit(X,y)
cuml_predict = cuml_model.predict(X)

print("Predicted labels : ", cuml_predict)

Output:

Predicted labels :  [0 1 0 1 0 1 0 1 0 1]

Methods

convert_to_fil_model(self[, output_class, …])

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

fit(self, X, y[, convert_dtype])

Perform Random Forest Classification on the input data

get_params(self[, deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(self, X[, predict_model, …])

Predicts the labels for X.

predict_proba(self, X[, output_class, …])

Predicts class probabilites for X.

print_detailed(self)

Prints the detailed information about the forest used to train and test the Random Forest model

print_summary(self)

Prints the summary of the forest used to train and test the model

score(self, X, y[, threshold, algo, …])

Calculates the accuracy metric score of the model for X.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

convert_to_fil_model(self, output_class=True, threshold=0.5, algo='auto', fil_sparse_format='auto')

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

Parameters
output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
fil_model :

A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

Returns
tl_to_fil_modelTreelite version of this model
fit(self, X, y, convert_dtype=False)

Perform Random Forest Classification on the input data

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (int32) of shape (n_samples, 1). Acceptable formats: NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy These labels should be contiguous integers from 0 to n_classes.

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters
deepboolean (default = True)
predict(self, X, predict_model='GPU', output_class=True, threshold=0.5, algo='auto', num_classes=2, convert_dtype=True, fil_sparse_format='auto')

Predicts the labels for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for binary classification problems.

output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored

num_classesint (default = 2)

number of different classes present in the dataset

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
yNumPy

Dense vector (int) of shape (n_samples, 1)

predict_proba(self, X, output_class=True, threshold=0.5, algo='auto', num_classes=2, convert_dtype=True, fil_sparse_format='auto')

Predicts class probabilites for X. This function uses the GPU implementation of predict. Therefore, data with ‘dtype = np.float32’ and ‘num_classes = 2’ should be used while using this function. The option to use predict_proba for multi_class classification is not currently implemented. Please check cuml issue #1679 for more information.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

output_class: boolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored

num_classesint (default = 2)

number of different classes present in the dataset

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
y(same as the input datatype)

Dense vector (float) of shape (n_samples, 1). The datatype of y depend on the value of ‘output_type’ varaible specified by the user while intializing the model.

print_detailed(self)

Prints the detailed information about the forest used to train and test the Random Forest model

print_summary(self)

Prints the summary of the forest used to train and test the model

score(self, X, y, threshold=0.5, algo='auto', num_classes=2, predict_model='GPU', convert_dtype=True, fil_sparse_format='auto')

Calculates the accuracy metric score of the model for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yNumPy

Dense vector (int) of shape (n_samples, 1)

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat

threshold is used to for classification This is optional and required only while performing the predict operation on the GPU.

num_classesinteger

number of different classes present in the dataset

convert_dtypeboolean, default=True

whether to convert input data to correct dtype automatically

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for binary classification problems.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
accuracyfloat

Accuracy of the model [0.0 - 1.0]

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters
paramsdict of new params
class cuml.ensemble.RandomForestRegressor

Implements a Random Forest regressor model which fits multiple decision trees in an ensemble. Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a histogram-based algorithm to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.

Known Limitations: This is an early release of the cuML Random Forest code. It contains a few known limitations:

  • GPU-based inference is only supported if the model was trained with 32-bit (float32) datatypes. CPU-based inference may be used in this case as a slower fallback.

  • Very deep / very wide models may exhaust available GPU memory. Future versions of cuML will provide an alternative algorithm to reduce memory consumption.

Parameters
n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

handlecuml.Handle

If it is None, a new one is created just for this class.

split_algoint (default = 1)

The algorithm to determine how nodes are split in the tree. 0 for HIST and 1 for GLOBAL_QUANTILE. HIST curently uses a slower tree-building algorithm so GLOBAL_QUANTILE is recommended for most cases.

split_criterionint (default = 2)

The criterion used to split nodes. 0 for GINI, 1 for ENTROPY, 2 for MSE, or 3 for MAE 0 and 1 not valid for regression

bootstrapboolean (default = True)
Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement. If False, sampling without replacement is done.

bootstrap_featuresboolean (default = False)

Control bootstrapping for features. If features are drawn with or without replacement

rows_samplefloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = 16)

Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1. Unlimited depth is not supported with split_algo=1. Note that this default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.

max_featuresint, float, or string (default = ‘auto’)

Ratio of number of features (columns) to consider per node split. If int then max_features/n_features. If float then max_features is used as a fraction. If ‘auto’ then max_features=1.0. If ‘sqrt’ then max_features=1/sqrt(n_features). If ‘log2’ then max_features=log2(n_features)/n_features.

n_binsint (default = 8)

Number of bins used by the split algorithm.

min_rows_per_nodeint or float (default = 2)

The minimum number of samples (rows) needed to split a node. If int then number of sample rows If float the min_rows_per_sample*n_rows

min_impurity_decreasefloat (default = 0.0)

The minimum decrease in impurity required for node to be split

accuracy_metricstring (default = ‘mse’)

Decides the metric used to evaluate the performance of the model. for median of abs error : ‘median_ae’ for mean of abs error : ‘mean_ae’ for mean square error’ : ‘mse’

quantile_per_treeboolean (default = False)

Whether quantile is computed for individal trees in RF. Only relevant for GLOBAL_QUANTILE split_algo.

seedint (default = None)

Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.

Examples

import numpy as np
from cuml.test.utils import get_handle
from cuml.ensemble import RandomForestRegressor as curfc
from cuml.test.utils import get_handle
X = np.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=np.float32)
y = np.asarray([0.0,1.0,2.0,3.0], dtype=np.float32)
cuml_model = curfc(max_features=1.0, n_bins=8,
                   split_algo=0, min_rows_per_node=2,
                   n_estimators=40, accuracy_metric='mse')
cuml_model.fit(X,y)
cuml_score = cuml_model.score(X,y)
print("MSE score of cuml : ", cuml_score)

Output:

MSE score of cuml :  0.1123437201231765

Methods

convert_to_fil_model(self[, output_class, …])

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

fit(self, X, y[, convert_dtype])

Perform Random Forest Regression on the input data

get_params(self[, deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(self, X[, predict_model, algo, …])

Predicts the labels for X.

print_detailed(self)

Prints the detailed information about the forest used to train and test the Random Forest model

print_summary(self)

Prints the summary of the forest used to train and test the model

score(self, X, y[, algo, convert_dtype, …])

Calculates the accuracy metric score of the model for X.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

convert_to_fil_model(self, output_class=False, algo='auto', fil_sparse_format='auto')

Create a Forest Inference (FIL) model from the trained cuML Random Forest model.

Parameters
output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
fil_model :

A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model(self)

Converts the cuML RF model to a Treelite model

Returns
tl_to_fil_modelTreelite version of this model
fit(self, X, y, convert_dtype=False)

Perform Random Forest Regression on the input data

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (int32) of shape (n_samples, 1). Acceptable formats: NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy These labels should be contiguous integers from 0 to n_classes.

get_params(self, deep=True)

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters
deepboolean (default = True)
predict(self, X, predict_model='GPU', algo='auto', convert_dtype=True, fil_sparse_format='auto')

Predicts the labels for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
yNumPy

Dense vector (int) of shape (n_samples, 1)

print_detailed(self)

Prints the detailed information about the forest used to train and test the Random Forest model

print_summary(self)

Prints the summary of the forest used to train and test the model

score(self, X, y, algo='auto', convert_dtype=True, fil_sparse_format='auto', predict_model='GPU')

Calculates the accuracy metric score of the model for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yNumPy

Dense vector (int) of shape (n_samples, 1)

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

convert_dtypeboolean, default=True

whether to convert input data to correct dtype automatically

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
mean_square_errorfloat or
median_abs_errorfloat or
mean_abs_errorfloat
set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters
paramsdict of new params

Forest Inferencing

class cuml.ForestInference

ForestInference provides GPU-accelerated inference (prediction) for random forest and boosted decision tree models.

This module does not support training models. Rather, users should train a model in another package and save it in a treelite-compatible format. (See https://github.com/dmlc/treelite) Currently, LightGBM, XGBoost and SKLearn GBDT and random forest models are supported.

Users typically create a ForestInference object by loading a saved model file with ForestInference.load. It is also possible to create it from an SKLearn model using ForestInference.load_from_sklearn. The resulting object provides a predict method for carrying out inference.

Known limitations:
  • A single row of data should fit into the shared memory of a thread block, which means that more than 12288 features are not supported.

  • From sklearn.ensemble, only {RandomForest,GradientBoosting}{Classifier,Regressor} models are supported; other sklearn.ensemble models are currently not supported.

  • Importing large SKLearn models can be slow, as it is done in Python.

  • LightGBM categorical features are not supported.

  • Inference uses a dense matrix format, which is efficient for many problems but can be suboptimal for sparse datasets.

  • Only binary classification and regression are supported.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class.

Notes

For additional usage examples, see the sample notebook at https://github.com/rapidsai/cuml/blob/branch-0.14/notebooks/forest_inference_demo.ipynb

Examples

In the example below, synthetic data is copied to the host before inference. ForestInference can also accept a numpy array directly at the cost of a slight performance overhead.

# Assume that the file 'xgb.model' contains a classifier model that was
# previously saved by XGBoost's save_model function.

import sklearn, sklearn.datasets, numpy as np
from numba import cuda
from cuml import ForestInference

model_path = 'xgb.model'
X_test, y_test = sklearn.datasets.make_classification()
X_gpu = cuda.to_device(np.ascontiguousarray(X_test.astype(np.float32)))
fm = ForestInference.load(model_path, output_class=True)
fil_preds_gpu = fm.predict(X_gpu)
accuracy_score = sklearn.metrics.accuracy_score(y_test,
               np.asarray(fil_preds_gpu))

Methods

load(filename[, output_class, threshold, …])

Returns a FIL instance containing the forest saved in ‘filename’ This uses Treelite to load the saved model.

load_from_sklearn(skl_model[, output_class, …])

Creates a FIL model using the scikit-learn model passed to the function.

load_from_treelite_model(self, model[, …])

Creates a FIL model using the treelite model passed to the function.

load_using_treelite_handle(self, model_handle)

Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.

predict(self, X[, preds])

Predicts the labels for X with the loaded forest model.

predict_proba(self, X[, preds])

Predicts the class probabilities for X with the loaded forest model.

static load(filename, output_class=False, threshold=0.5, algo='auto', storage_type='auto', model_type='xgboost', handle=None)

Returns a FIL instance containing the forest saved in ‘filename’ This uses Treelite to load the saved model.

Parameters
filenamestring

Path to saved model file in a treelite-compatible format (See https://treelite.readthedocs.io/en/latest/treelite-api.html

for more information)

output_classbool (default=False)

If True, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

thresholdfloat (default=0.5)

Cutoff value above which a prediction is set to 1.0 Only used if the model is classification and output_class is True

algostring (default=’auto’)

Which inference algorithm to use. See documentation in FIL.load_from_treelite_model

storage_typestring (default=’auto’)

In-memory storage format to be used for the FIL model. See documentation in FIL.load_from_treelite_model

model_typestring (default=”xgboost”)

Format of the saved treelite model to be load. It can be ‘xgboost’, ‘lightgbm’, or ‘protobuf’.

Returns
fil_model :

A Forest Inference model which can be used to perform inferencing on the model read from the file.

static load_from_sklearn(skl_model, output_class=False, threshold=0.5, algo='auto', storage_type='auto', handle=None)

Creates a FIL model using the scikit-learn model passed to the function. This function requires Treelite 0.90 to be installed.

Parameters
skl_modelThe scikit-learn model from which to build the FIL version.
output_class: boolean (default=False)

If True, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default=’auto’)
name of the algo from (from algo_t enum)
‘AUTO’ or ‘auto’ - choose the algorithm automatically;

currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

‘NAIVE’ or ‘naive’ - simple inference using shared memory ‘TREE_REORG’ or ‘tree_reorg’ - similar to naive but trees

rearranged to be more coalescing-friendly

‘BATCH_TREE_REORG’ or ‘batch_tree_reorg’ - similar to TREE_REORG

but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring (default=’auto’)
In-memory storage format to be used for the FIL model.
‘AUTO’ or ‘auto’ - choose the storage type automatically

(currently DENSE is always used)

‘DENSE’ or ‘dense’ - create a dense forest ‘SPARSE’ or ‘sparse’ - create a sparse forest;

requires algo=’NAIVE’ or algo=’AUTO’.

Returns
fil_model :

A Forest Inference model created from the scikit-learn model passed.

load_from_treelite_model(self, model, output_class=False, algo='auto', threshold=0.5, storage_type='auto')

Creates a FIL model using the treelite model passed to the function.

Parameters
modelthe trained model information in the treelite format

loaded from a saved model using the treelite API https://treelite.readthedocs.io/en/latest/treelite-api.html

output_class: boolean (default=False)

If True, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default=’auto’)
name of the algo from (from algo_t enum)
‘AUTO’ or ‘auto’ - choose the algorithm automatically;

currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage

‘NAIVE’ or ‘naive’ - simple inference using shared memory ‘TREE_REORG’ or ‘tree_reorg’ - similar to naive but trees

rearranged to be more coalescing-friendly

‘BATCH_TREE_REORG’ or ‘batch_tree_reorg’ - similar to TREE_REORG

but predicting multiple rows per thread block

thresholdfloat (default=0.5)

Threshold is used to for classification. It is applied only if output_class == True, else it is ignored.

storage_typestring (default=’auto’)
In-memory storage format to be used for the FIL model.
‘AUTO’ or ‘auto’ - choose the storage type automatically

(currently DENSE is always used)

‘DENSE’ or ‘dense’ - create a dense forest ‘SPARSE’ or ‘sparse’ - create a sparse forest;

requires algo=’NAIVE’ or algo=’AUTO’

Returns
fil_model :

A Forest Inference model which can be used to perform inferencing on the random forest/ XGBoost model.

load_using_treelite_handle(self, model_handle, output_class=False, algo='auto', storage_type='auto', threshold=0.5)

Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.

Parameters
model_handleModelhandle to the treelite forest model

(See https://treelite.readthedocs.io/en/latest/treelite-api.html for more information)

output_classbool (default=False)

If True, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

thresholdfloat (default=0.5)

Cutoff value above which a prediction is set to 1.0 Only used if the model is classification and output_class is True

algostring (default=’auto’)

Which inference algorithm to use. See documentation in FIL.load_from_treelite_model

storage_typestring (default=’auto’)

In-memory storage format to be used for the FIL model. See documentation in FIL.load_from_treelite_model

Returns
fil_model :

A Forest Inference model which can be used to perform inferencing on the random forest model.

predict(self, X, preds=None)

Predicts the labels for X with the loaded forest model. By default, the result is the raw floating point output from the model, unless output_class was set to True during model loading.

See the documentation of ForestInference.load for details.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy For optimal performance, pass a device array with C-style layout

preds: gpuarray or cudf.Series, shape = (n_samples,)

Optional ‘out’ location to store inference results

Returns
GPU array of length n_samples with inference results
(or ‘preds’ filled with inference results if preds was specified)
predict_proba(self, X, preds=None)

Predicts the class probabilities for X with the loaded forest model. The result is the raw floating point output from the model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy For optimal performance, pass a device array with C-style layout

preds: gpuarray or cudf.Series, shape = (n_samples,2)

binary probability output Optional ‘out’ location to store inference results

Returns
GPU array of shape (n_samples,2) with inference results
(or ‘preds’ filled with inference results if preds was specified)

Coordinate Descent

class cuml.CD

Coordinate Descent (CD) is a very common optimization algorithm that minimizes along coordinate directions to find the minimum of a function.

cuML’s CD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset.algorithm The CD algorithm currently works with linear regression and ridge, lasso, and elastic-net penalties.

Parameters
loss‘squared_loss’ (Only ‘squared_loss’ is supported right now)

‘squared_loss’ uses linear regression

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization. ‘alpha = 0’ is equivalent to an ordinary least square, solved by the LinearRegression object.

l1_ratio: float (default = 0.15)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

max_iterint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, solver stops.

shuffleboolean (default = True)

If set to ‘True’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘True’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

Examples

import numpy as np
import cudf
from cuml.solvers import CD as cumlCD

cd = cumlCD(alpha=0.0)

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

reg = cd.fit(X,y)

print("Coefficients:")
print(reg.coef_)
print("intercept:")
print(reg.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)

preds = cd.predict(X_new)

print(preds)

Output:

Coefficients:
            0 1.0019531
            1 1.9980469
Intercept:
            3.0
Preds:
            0 15.997
            1 14.995

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

predict(self, X[, convert_dtype])

Predicts the y for X.

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Quasi-Newton

class cuml.QN

Quasi-Newton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.

Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1 regularization

  • Limited Memory BFGS (L-BFGS) otherwise.

cuML’s QN class can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).

Parameters
loss: ‘sigmoid’, ‘softmax’, ‘squared_loss’ (default = ‘squared_loss’)

‘sigmoid’ loss used for single class logistic regression ‘softmax’ loss used for multiclass logistic regression ‘normal’ used for normal/square loss

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_strength: float (default = 0.0)

l1 regularization strength (if non-zero, will run OWL-QN, else L-BFGS). Note, that as in Scikit-learn, the bias will not be regularized.

l2_strength: float (default = 0.0)

l2 regularization strength. Note, that as in Scikit-learn, the bias will not be regularized.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

tol: float (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

linesearch_max_iter: int (default = 50)

Max number of linesearch iterations per outer iteration of the algorithm.

lbfgs_memory: int (default = 5)

Rank of the lbfgs inverse-Hessian approximation. Method will use O(lbfgs_memory * D) memory.

verboseint or boolean (default = False)

Controls verbose level of logging.

Notes

This class contains implementations of two popular Quasi-Newton methods:

Examples

import cudf
import numpy as np

# Both import methods supported
# from cuml import QN
from cuml.solvers import QN

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) )

solver = QN()
solver.fit(X,y)

# Note: for now, the coefficients also include the intercept in the
# last position if fit_intercept=True
print("Coefficients:")
print(solver.coef_.copy_to_host())
print("Intercept:")
print(solver.intercept_.copy_to_host())

X_new = cudf.DataFrame()
X_new['col1'] = np.array([1,5], dtype = np.float32)
X_new['col2'] = np.array([2,5], dtype = np.float32)

preds = solver.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:
            10.647417
            0.3267412
            -17.158297
Intercept:
            -17.158297
Predictions:
            0    0.0
            1    1.0
Attributes
coef_array, shape (n_classes, n_features)

The estimated coefficients for the linear regression model. Note: shape is (n_classes, n_features + 1) if fit_intercept = True.

intercept_array (n_classes, 1)

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y[, convert_dtype])

Fit the model with X and y.

get_param_names(self)

predict(self, X[, convert_dtype])

Predicts the y for X.

score(self, X, y)

fit(self, X, y, convert_dtype=False)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names(self)
predict(self, X, convert_dtype=False)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

score(self, X, y)

Support Vector Machines

class cuml.svm.SVC(C-Support Vector Classification)

Construct an SVC classifier for training and predictions.

Parameters
handlecuml.Handle

If it is None, a new one is created for this class

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options: - ‘auto’: gamma will be set to 1 / n_features - ‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only signifficant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

cache_sizefloat (default = 200.0)

Size of the kernel cache during training in MiB. The default is a conservative value, increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.

max_iterint (default = 100*n_samples)

Limit the number of outer iterations in the solver

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

verboseint or boolean (default = False)

verbosity level

Notes

The solver uses the SMO method to fit the classifier. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]

References

[1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)

[2] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal of Machine Learning Research, 19, 1-5 (2018) https://github.com/Xtra-Computing/thundersvm

Examples

import numpy as np
from cuml.svm import SVC
X = np.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]],
             dtype=np.float32);
y = np.array([-1, -1, 1, -1, 1, 1], dtype=np.float32)
clf = SVC(kernel='poly', degree=2, gamma='auto', C=1)
clf.fit(X, y)
print("Predicted labels:", clf.predict(X))

Output:

Predicted labels: [-1. -1.  1. -1.  1.  1.]
Attributes
n_support_int

The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see https://github.com/rapidsai/cuml/issues/956 )

support_int, shape = (n_support)

Device array of suppurt vector indices

support_vectors_float, shape (n_support, n_cols)

Device array of support vectors

dual_coef_float, shape = (1, n_support)

Device array of coefficients for support vectors

intercept_int

The constant in the decision function

fit_status_int

0 if SVM is correctly fitted

coef_float, shape (1, n_cols)

SVMBase.coef_(self)

For additional docs, see `scikitlearn’s SVC
<https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_.

Methods

decision_function(self, X)

Calculates the decision function values for X.

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the class labels for X.

decision_function(self, X)

Calculates the decision function values for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
ycuDF Series

Dense vector (floats or doubles) of shape (n_samples, 1)

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict(self, X)

Predicts the class labels for X. The returned y values are the class labels associated to sign(decision_function(X)).

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
ycuDF Series

Dense vector (floats or doubles) of shape (n_samples, 1)

class cuml.svm.SVR(Epsilon Support Vector Regression)

Construct an SVC classifier for training and predictions.

Parameters
handlecuml.Handle

If it is None, a new one is created for this class

Cfloat (default = 1.0)

Penalty parameter C

kernelstring (default=’rbf’)

Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.

degreeint (default=3)

Degree of polynomial kernel function.

gammafloat or string (default = ‘scale’)

Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options: - ‘auto’: gamma will be set to 1 / n_features - ‘scale’: gamma will be se to 1 / (n_features * X.var())

coef0float (default = 0.0)

Independent term in kernel function, only signifficant for poly and sigmoid

tolfloat (default = 1e-3)

Tolerance for stopping criterion.

epsilon: float (default = 0.1)

epsilon parameter of the epsiron-SVR model. There is no penalty associated to points that are predicted within the epsilon-tube around the target values.

cache_sizefloat (default = 200 MiB)

Size of the kernel cache during training in MiB. The default is a conservative value, increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.

max_iterint (default = 100*n_samples)

Limit the number of outer iterations in the solver

nochange_stepsint (default = 1000)

We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e-3*tol) for nochange_steps consecutive steps, then we stop training.

verboseint or boolean (default = False)

verbosity level

Notes

For additional docs, see Scikit-learn’s SVR.

The solver uses the SMO method to fit the regressor. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]

References

[1] J. Vanek et al. A GPU-Architecture Optimized Hierarchical Decomposition

Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)

[2] Z. Wen et al. ThunderSVM: A Fast SVM Library on GPUs and CPUs, Journal * of Machine Learning Research, 19, 1-5 (2018)

Examples

import numpy as np
from cuml.svm import SVR
X = np.array([[1], [2], [3], [4], [5]], dtype=np.float32)
y = np.array([1.1, 4, 5, 3.9, 1.], dtype = np.float32)
reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1)
reg.fit(X, y)
print("Predicted values:", reg.predict(X))

Output:

Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]
Attributes
n_support_int

The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see Issue #956)

support_int, shape = [n_support]

Device array of suppurt vector indices

support_vectors_float, shape [n_support, n_cols]

Device array of support vectors

dual_coef_float, shape = [1, n_support]

Device array of coefficients for support vectors

intercept_int

The constant in the decision function

fit_status_int

0 if SVM is correctly fitted

coef_float, shape [1, n_cols]

SVMBase.coef_(self)

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the values for X.

score(self, X, y)

Return R^2 score of the prediction.

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict(self, X)

Predicts the values for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
ycuDF Series

Dense vector (floats or doubles) of shape (n_samples, 1)

score(self, X, y)

Return R^2 score of the prediction.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of target values. Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
score: float R^2 score

Nearest Neighbors Classification

class cuml.neighbors.KNeighborsClassifier

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean (default = False)

Logging level

handlecumlHandle

The cumlHandle resources to use

algorithmstring (default=’brute’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

from cuml.neighbors import KNeighborsClassifier

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples=100, centers=5,
                  n_features=10)

knn = KNeighborsClassifier(n_neighbors=10)

X_train, X_test, y_train, y_test =
  train_test_split(X, y, train_size=0.80)

knn.fit(X_train, y_train)

knn.predict(X_test)

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors classifier model.

get_param_names(self)

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to predict the labels for X

predict_proba(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to predict the label probabilities for X

score(self, X, y[, convert_dtype])

Compute the accuracy score using the given labels and the trained k-nearest neighbors classifier to predict the classes for X.

fit(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors classifier model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_outputs)

Dense matrix (floats or doubles) of shape (n_samples, n_outputs). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

get_param_names(self)
predict(self, X, convert_dtype=True)

Use the trained k-nearest neighbors classifier to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

predict_proba(self, X, convert_dtype=True)

Use the trained k-nearest neighbors classifier to predict the label probabilities for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

score(self, X, y, convert_dtype=True)

Compute the accuracy score using the given labels and the trained k-nearest neighbors classifier to predict the classes for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

Nearest Neighbors Regression

class cuml.neighbors.KNeighborsRegressor

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean (default = False)

Logging level

handlecumlHandle

The cumlHandle resources to use

algorithmstring (default=’brute’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

from cuml.neighbors import KNeighborsRegressor

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples=100, centers=5,
                  n_features=10)

knn = KNeighborsRegressor(n_neighbors=10)

X_train, X_test, y_train, y_test =
  train_test_split(X, y, train_size=0.80)

knn.fit(X_train, y_train)

knn.predict(X_test)

Output:

array([3.        , 1.        , 1.        , 3.79999995, 2.        ,
       0.        , 3.79999995, 3.79999995, 3.79999995, 0.        ,
       3.79999995, 0.        , 1.        , 2.        , 3.        ,
       1.        , 0.        , 0.        , 0.        , 2.        ,
       3.        , 3.        , 0.        , 3.        , 3.79999995,
       3.79999995, 3.79999995, 3.79999995, 3.        , 2.        ,
       3.79999995, 3.79999995, 0.        ])

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors regression model.

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors regression model to predict the labels for X

score(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors regression model.

fit(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors regression model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_outputs)

Dense matrix (floats or doubles) of shape (n_samples, n_outputs). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

predict(self, X, convert_dtype=True)

Use the trained k-nearest neighbors regression model to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

score(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors regression model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

Clustering

K-Means Clustering

class cuml.KMeans

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.

cuML’s KMeans expects an array-like object or cuDF DataFrame, and supports the scalable KMeans++ initialization method. This method is more stable than randomly selecting K points.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class.

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat64 (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseint or boolean (default = False)

Logging level.

random_stateint (default = 1)

If you want results to be the same when you restart Python, select a state.

init‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray (default = ‘scalable-k-means++’) # noqa

‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ initialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init: int (default = 1)

Number of instances the k-means algorithm will be called with different seeds. The final results will be from the instance that produces lowest inertia out of n_init instances.

oversampling_factorfloat64

scalable k-means|| oversampling factor

max_samples_per_batchint (default=1<<15)

maximum number of samples to use for each batch of the pairwise distance computation.

oversampling_factorint (default = 2)

The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.

max_samples_per_batchint (default = 32768)

The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

Notes

KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.

Applications of KMeans

The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.

For additional docs, see scikitlearn’s Kmeans.

Examples

# Both import methods supported
from cuml import KMeans
from cuml.cluster import KMeans

import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
    # convert numpy array to cuDF dataframe
    df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
    pdf = cudf.DataFrame()
    for c,column in enumerate(df):
      pdf[str(c)] = df[column]
    return pdf

a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
               dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)

print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Output:

input:

     0    1
 0  1.0  1.0
 1  1.0  2.0
 2  3.0  2.0
 3  4.0  3.0

Calling fit

labels:

   0    0
   1    0
   2    1
   3    1

cluster_centers:

   0    1
0  1.0  1.5
1  3.5  2.5
Attributes
cluster_centers_array

The coordinates of the final clusters. This represents of “mean” of each data cluster.

labels_array

Which cluster each datapoint belongs to.

Methods

fit(self, X[, sample_weight])

Compute k-means clustering with X.

fit_predict(self, X[, sample_weight])

Compute cluster centers and predict cluster index for each sample.

fit_transform(self, X[, convert_dtype])

Compute clustering and transform X to cluster-distance space.

get_param_names(self)

predict(self, X[, convert_dtype, sample_weight])

Predict the closest cluster each sample in X belongs to.

score(self, X)

Opposite of the value of X on the K-means objective.

transform(self, X[, convert_dtype])

Transform X to a cluster-distance space.

fit(self, X, sample_weight=None)

Compute k-means clustering with X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

sample_weightarray-like (device or host) shape = (n_samples,), default=None # noqa

The weights for each observation in X. If None, all observations are assigned equal weight.

fit_predict(self, X, sample_weight=None)

Compute cluster centers and predict cluster index for each sample.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

sample_weightarray-like (device or host) shape = (n_samples,), default=None # noqa

The weights for each observation in X. If None, all observations are assigned equal weight.

fit_transform(self, X, convert_dtype=False)

Compute clustering and transform X to cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the fit_transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

get_param_names(self)
predict(self, X, convert_dtype=False, sample_weight=None)

Predict the closest cluster each sample in X belongs to.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
labelsarray
Which cluster each datapoint belongs to.
score(self, X)

Opposite of the value of X on the K-means objective.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
score: float

Opposite of the value of X on the K-means objective.

transform(self, X, convert_dtype=False)

Transform X to a cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

DBSCAN

class cuml.DBSCAN

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects an array-like object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters
epsfloat (default = 0.5)

The maximum distance between 2 points such they reside in the same neighborhood.

handlecuml.Handle

If it is None, a new one is created just for this class

min_samplesint (default = 5)

The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).

verboseint or boolean (default = False)

Logging level

max_mbytes_per_batch(optional) int64

Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.

output_type(optional) {‘input’, ‘cudf’, ‘cupy’, ‘numpy’} default = None

Use it to control output type of the results and attributes. If None it’ll inherit the output type set at the module level, cuml.output_type. If that has not been changed, by default the estimator will mirror the type of the data used for each fit or predict call. If set, the estimator will override the global option for its behavior.

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For additional docs, see scikitlearn’s DBSCAN.

Examples

# Both import methods supported
from cuml import DBSCAN
from cuml.cluster import DBSCAN

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
dbscan_float.fit(gdf_float)
print(dbscan_float.labels_)

Output:

0    0
1    1
2    2
Attributes
labels_array-like or cuDF series

Which cluster each datapoint belongs to. Noisy samples are labeled as -1. Format depends on cuml global output type and estimator output_type.

Methods

fit(self, X[, out_dtype])

Perform DBSCAN clustering from features.

fit_predict(self, X[, out_dtype])

Performs clustering on input_gdf and returns cluster labels.

get_param_names(self)

fit(self, X, out_dtype='int32')

Perform DBSCAN clustering from features.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

out_dtype: dtype Determines the precision of the output labels array.

default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}. When the number of samples exceed

fit_predict(self, X, out_dtype='int32')

Performs clustering on input_gdf and returns cluster labels.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
ycuDF Series, shape (n_samples)

cluster labels

get_param_names(self)

Dimensionality Reduction and Manifold Learning

Principal Component Analysis

class cuml.PCA

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s PCA expects an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters
copyboolean (default = True)

If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.

handlecuml.Handle

If it is None, a new one is created just for this class

iterated_powerint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but slower.

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.

verboseint or boolean (default = False)

Logging level

whitenboolean (default = False)

If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Notes

PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For additional docs, see scikitlearn’s PCA.

Examples

# Both import methods supported
from cuml import PCA
from cuml.decomposition import PCA

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

pca_float = PCA(n_components = 2)
pca_float.fit(gdf_float)

print(f'components: {pca_float.components_}')
print(f'explained variance: {pca_float._explained_variance_}')
exp_var = pca_float._explained_variance_ratio_
print(f'explained variance ratio: {exp_var}')

print(f'singular values: {pca_float._singular_values_}')
print(f'mean: {pca_float._mean_}')
print(f'noise variance: {pca_float._noise_variance_}')

trans_gdf_float = pca_float.transform(gdf_float)
print(f'Inverse: {trans_gdf_float}')

input_gdf_float = pca_float.inverse_transform(trans_gdf_float)
print(f'Input: {input_gdf_float}')

Output:

components:
            0           1           2
            0  0.69225764  -0.5102837 -0.51028395
            1 -0.72165036 -0.48949987  -0.4895003

explained variance:

            0   8.510402
            1 0.48959687

explained variance ratio:

             0   0.9456003
             1 0.054399658

singular values:

           0 4.1256275
           1 0.9895422

mean:

          0 2.6666667
          1 2.3333333
          2 2.3333333

noise variance:

      0  0.0

transformed matrix:
             0           1
             0   -2.8547091 -0.42891636
             1 -0.121316016  0.80743366
             2    2.9760244 -0.37851727

Input Matrix:
          0         1         2
          0 1.0000001 3.9999993       4.0
          1       2.0 2.0000002 1.9999999
          2 4.9999995 1.0000006       1.0
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

mean_array

The column wise mean of X. Used to mean - center the data first.

noise_variance_float

From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

fit(self, X[, y])

Fit the model with X.

fit_transform(self, X[, y])

Fit the model with X and apply the dimensionality reduction on X.

get_param_names(self)

inverse_transform(self, X[, convert_dtype])

Transform data back to its original space.

transform(self, X[, convert_dtype])

Apply dimensionality reduction to X.

fit(self, X, y=None)

Fit the model with X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yignored
Returns
cluster labels
fit_transform(self, X, y=None)

Fit the model with X and apply the dimensionality reduction on X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

training data (floats or doubles), where n_samples is the number of samples, and n_features is the number of features. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yignored
Returns
X_newcuDF DataFrame, shape (n_samples, n_components)
get_param_names(self)
inverse_transform(self, X, convert_dtype=False)

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the inverse_transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)
transform(self, X, convert_dtype=False)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Truncated SVD

class cuml.TruncatedSVD

TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.

cuML’s TruncatedSVD an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.

Parameters
algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

n_iterint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but slower.

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.

verboseint or boolean (default = False)

Logging level

Notes

TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many, many components.

Applications of TruncatedSVD

TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.

For additional documentation, see scikitlearn’s TruncatedSVD docs.

Examples

# Both import methods supported
from cuml import TruncatedSVD
from cuml.decomposition import TruncatedSVD

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi",
                          n_iter = 20, tol = 1e-9)
tsvd_float.fit(gdf_float)

print(f'components: {tsvd_float.components_}')
print(f'explained variance: {tsvd_float._explained_variance_}')
exp_var = tsvd_float._explained_variance_ratio_
print(f'explained variance ratio: {exp_var}')
print(f'singular values: {tsvd_float._singular_values_}')

trans_gdf_float = tsvd_float.transform(gdf_float)
print(f'Transformed matrix: {trans_gdf_float}')

input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float)
print(f'Input matrix: {input_gdf_float}')

Output:

components:            0           1          2
0 0.58725953  0.57233137  0.5723314
1 0.80939883 -0.41525528 -0.4152552
explained variance:
0  55.33908
1 16.660923

explained variance ratio:
0  0.7685983
1 0.23140171

singular values:
0  7.439024
1 4.0817795

Transformed Matrix:
0           1         2
0   5.1659107    -2.512643
1   3.4638448    -0.042223275
2    4.0809603   3.2164836

Input matrix:           0         1         2
0       1.0  4.000001  4.000001
1 2.0000005 2.0000005 2.0000007
2  5.000001 0.9999999 1.0000004
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

Methods

fit(self, X[, y])

Fit LSI model on training cudf DataFrame X.

fit_transform(self, X[, y])

Fit LSI model to X and perform dimensionality reduction on X.

get_param_names(self)

inverse_transform(self, X[, convert_dtype])

Transform X back to its original space.

transform(self, X[, convert_dtype])

Perform dimensionality reduction on X.

fit(self, X, y=None)

Fit LSI model on training cudf DataFrame X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y : ignored

fit_transform(self, X, y=None)

Fit LSI model to X and perform dimensionality reduction on X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yignored
Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X as a dense cuDF DataFrame

get_param_names(self)
inverse_transform(self, X, convert_dtype=False)

Transform X back to its original space. Returns a cuDF DataFrame X_original whose transform would be X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the inverse_transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)

Note that this is always a dense cuDF DataFrame.

transform(self, X, convert_dtype=False)

Perform dimensionality reduction on X. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = False)

When set to True, the transform method will automatically convert the input to the data type which was used to train the model.

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X. This will always be a dense DataFrame.

UMAP

class cuml.UMAP

Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py

Parameters
n_neighbors: float (optional, default 15)

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_components: int (optional, default 2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any

n_epochs: int (optional, default None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_rate: float (optional, default 1.0)

The initial learning rate for the embedding optimization.

init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
  • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

  • ‘random’: assign initial embedding positions at random.

min_dist: float (optional, default 0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread: float (optional, default 1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratio: float (optional, default 1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity: int (optional, default 1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength: float (optional, default 1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size: float (optional, default 4.0)

For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

hash_input: UMAP can hash the training input so that exact embeddings

are returned when transform is called on the same data upon which the model was trained. This enables consistent behavior between calling model.fit_transform(X) and calling model.fit(X).transform(X). Not that the CPU-based UMAP reference implementation does this by default. This feature is made optional in the GPU version due to the significant overhead in copying memory to the host for computing the hash. (default = False)

random_stateint, RandomState instance or None, optional (default=None)

random_state is the seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Note: Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage.

optim_batch_size: int (optional, default 100000 / n_components)

Used to maintain the consistency of embeddings for large datasets. The optimization step will be processed with at most optim_batch_size edges at once preventing inconsistencies. A lower batch size will yield more consistently repeatable embeddings at the cost of speed.

callback: An instance of GraphBasedDimRedCallback class to intercept

the internal state of embeddings while they are being trained. Example of callback usage:

from cuml.internals import GraphBasedDimRedCallback class CustomCallback(GraphBasedDimRedCallback):

def on_preprocess_end(self, embeddings):

print(embeddings.copy_to_host())

def on_epoch_end(self, embeddings):

print(embeddings.copy_to_host())

def on_train_end(self, embeddings):

print(embeddings.copy_to_host())

verboseint or boolean (default = False)

Controls verbosity of logging.

Notes

This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:

  • Using a non-Euclidean distance metric (support for a fixed set of non-Euclidean metrics is planned for an upcoming release).

  • Using a pre-computed pairwise distance matrix (under consideration for future releases)

  • Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.

Known issue: If a UMAP model has not yet been fit, it cannot be pickled. However, after fitting, a UMAP mode.

References

Methods

find_ab_params(spread, min_dist)

Function taken from UMAP-learn : https://github.com/lmcinnes/umap Fit a, b params for the differentiable curve used in lower dimensional fuzzy simplicial complex construction.

fit(self, X[, y, convert_dtype, knn_graph])

Fit X into an embedded space.

fit_transform(self, X[, y, convert_dtype, …])

Fit X into an embedded space and return that transformed output.

transform(self, X[, convert_dtype, knn_graph])

Transform X into the existing embedded space and return that transformed output.

validate_hyperparams(self)

static find_ab_params(spread, min_dist)

Function taken from UMAP-learn : https://github.com/lmcinnes/umap Fit a, b params for the differentiable curve used in lower dimensional fuzzy simplicial complex construction. We want the smooth curve (from a pre-defined family with simple gradient) that best matches an offset exponential decay.

fit(self, X, y=None, convert_dtype=True, knn_graph=None)

Fit X into an embedded space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

y contains a label per row. Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

knn_graphsparse array-like (device or host)

shape=(n_samples, n_samples) A sparse array containing the k-nearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that k>=n_neighbors, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR

fit_transform(self, X, y=None, convert_dtype=True, knn_graph=None)

Fit X into an embedded space and return that transformed output.

There is a subtle difference between calling fit_transform(X) and calling fit().transform(). Calling fit_transform(X) will train the embeddings on X and return the embeddings. Calling fit(X).transform(X) will train the embeddings on X and then run a second optimization return the embedding after it is trained while calling

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

knn_graphsparse array-like (device or host)

shape=(n_samples, n_samples) A sparse array containing the k-nearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that k>=n_neighbors, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR

Returns
X_newarray, shape (n_samples, n_components)

Embedding of the training data in low-dimensional space.

transform(self, X, convert_dtype=True, knn_graph=None)

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data to be transformed. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

knn_graphsparse array-like (device or host)

shape=(n_samples, n_samples) A sparse array containing the k-nearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that k>=n_neighbors, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR

Returns
X_newarray, shape (n_samples, n_components)

Embedding of the new data in low-dimensional space.

validate_hyperparams(self)

Random Projections

class cuml.random_projection.GaussianRandomProjection

Gaussian Random Projection method derivated from BaseRandomProjection class.

Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.

The components of the random matrix are drawn from N(0, 1 / n_components).

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = ‘auto’)

Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.

The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.

epsfloat (default = 0.1)

Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.

random_stateint (default = None)

Seed used to initilize random generator

Notes

Inspired by Scikit-learn’s implementation : https://scikit-learn.org/stable/modules/random_projection.html

Attributes
gaussian_methodboolean

To be passed to base class in order to determine random matrix generation method

class cuml.random_projection.SparseRandomProjection

Sparse Random Projection method derivated from BaseRandomProjection class.

Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.

Sparse random matrix is an alternative to dense random projection matrix (e.g. Gaussian) that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data (with sparse enough matrices). If we note ‘s = 1 / density’ the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(n_components) with probability 1 / 2s

  • 0 with probability 1 - 1 / s

  • +sqrt(s) / sqrt(n_components) with probability 1 / 2s

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = ‘auto’)

Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.

The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.

densityfloat in range (0, 1] (default = ‘auto’)

Ratio of non-zero component in the random projection matrix.

If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

epsfloat (default = 0.1)

Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.

dense_outputboolean (default = True)

If set to True transformed matrix will be dense otherwise sparse.

random_stateint (default = None)

Seed used to initilize random generator

Notes

Inspired by Scikit-learn’s implementation : https://scikit-learn.org/stable/modules/random_projection.html

Attributes
gaussian_methodboolean

To be passed to base class in order to determine random matrix generation method

random_projection.johnson_lindenstrauss_min_dim(n_samples, eps=0.1)

In mathematics, the Johnson–Lindenstrauss lemma states that high-dimensional data can be embedded into lower dimension while preserving the distances.

With p the random projection : (1 - eps) ||u - v||^2 < ||p(u) - p(v)||^2 < (1 + eps) ||u - v||^2

This function finds the minimum number of components to guarantee that the embedding is inside the eps error tolerance.

Parameters
n_samplesint

Number of samples.

epsfloat in (0,1) (default = 0.1)

Maximum distortion rate as defined by the Johnson-Lindenstrauss lemma.

Returns
n_componentsint

The minimal number of components to guarantee with good probability an eps-embedding with n_samples.

TSNE

class cuml.TSNE

TSNE (T-Distributed Stochastic Neighbor Embedding) is an extremely powerful dimensionality reduction technique that aims to maintain local distances between data points. It is extremely robust to whatever dataset you give it, and is used in many areas including cancer research, music analysis and neural network weight visualizations.

Currently, cuML’s TSNE supports the fast Barnes Hut O(NlogN) TSNE approximation (derived from CannyLabs’ BH open source CUDA code). This allows TSNE to produce extremely fast embeddings when n_components = 2. cuML defaults to this algorithm. A slower but more accurate Exact algorithm is also provided.

Parameters
n_componentsint (default 2)

The output dimensionality size. Currently only size=2 is tested, but the ‘exact’ algorithm will support greater dimensionality in future.

perplexityfloat (default 30.0)

Larger datasets require a larger value. Consider choosing different perplexity values from 5 to 50 and see the output differences.

early_exaggerationfloat (default 12.0)

Controls the space between clusters. Not critical to tune this.

learning_ratefloat (default 200.0)

The learning rate usually between (10, 1000). If this is too high, TSNE could look like a cloud / ball of points.

n_iterint (default 1000)

The more epochs, the more stable/accurate the final embedding.

n_iter_without_progressint (default 300)

When the KL Divergence becomes too small after some iterations, terminate TSNE early.

min_grad_normfloat (default 1e-07)

The minimum gradient norm for when TSNE will terminate early.

metricstr ‘euclidean’ only (default ‘euclidean’)

Currently only supports euclidean distance. Will support cosine in a future release.

initstr ‘random’ (default ‘random’)

Currently supports random intialization.

verboseint or boolean (default = False) (default logger.level_info)

Level of verbosity. Most messages will be printed inside the Python Console.

random_stateint (default None)

Setting this can allow future runs of TSNE to look mostly the same. It is known that TSNE tends to have vastly different outputs on many runs. Try using PCA intialization (upcoming with change #1098) to possibly counteract this problem. It is known that small perturbations can directly change the result of the embedding for parallel TSNE implementations.

methodstr ‘barnes_hut’ or ‘exact’ (default ‘barnes_hut’)

Options are either barnes_hut or exact. It is recommended that you use the barnes hut approximation for superior O(nlogn) complexity.

anglefloat (default 0.5)

Tradeoff between accuracy and speed. Choose between (0,2 0.8) where closer to one indicates full accuracy but slower speeds.

learning_rate_methodstr ‘adaptive’, ‘none’ or None (default ‘adaptive’)

Either adaptive or None. Uses a special adpative method that tunes the learning rate, early exaggeration and perplexity automatically based on input size.

n_neighborsint (default 90)

The number of datapoints you want to use in the attractive forces. Smaller values are better for preserving local structure, whilst larger values can improve global structure preservation. Default is 3 * 30 (perplexity)

perplexity_max_iterint (default 100)

The number of epochs the best gaussian bands are found for.

exaggeration_iterint (default 250)

To promote the growth of clusters, set this higher.

pre_momentumfloat (default 0.5)

During the exaggeration iteration, more forcefully apply gradients.

post_momentumfloat (default 0.8)

During the late phases, less forcefully apply gradients.

handle(cuML Handle, default None)

You can pass in a past handle that was initialized, or we will create one for you anew!

References

  • van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding https://lvdmaaten.github.io/tsne/

  • van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.

  • George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding

Methods

fit(self, X[, convert_dtype])

Fit X into an embedded space.

fit_transform(self, X[, convert_dtype])

Fit X into an embedded space and return that transformed output.

fit(self, X, convert_dtype=True)

Fit X into an embedded space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

fit_transform(self, X, convert_dtype=True)

Fit X into an embedded space and return that transformed output.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit_transform method will automatically convert the inputs to np.float32.

Returns
X_newarray, shape (n_samples, n_components)

Embedding of the training data in low-dimensional space.

Neighbors

Nearest Neighbors

class cuml.neighbors.NearestNeighbors

NearestNeighbors is an queries neighborhoods from a given set of datapoints. Currently, cuML supports k-NN queries, which define the neighborhood as the closest k neighbors to each query point.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean (default = False)

Logging level

handlecumlHandle

The cumlHandle resources to use

algorithmstring (default=’brute’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

Notes

For an additional example see the NearestNeighbors notebook.

For additional docs, see scikit-learn’s NearestNeighbors.

Examples

import cudf
from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_blobs

X, _ = make_blobs(n_samples=25, centers=5,
                  n_features=10, random_state=42)

# build a cudf Dataframe
X_cudf = cudf.DataFrame.from_gpu_matrix(X)

# fit model
model = NearestNeighbors(n_neighbors=3)
model.fit(X)

# get 3 nearest neighbors
distances, indices = model.kneighbors(X_cudf)

# print results
print(indices)
print(distances)

Output:

indices:

     0   1   2
0    0  14  21
1    1  19   8
2    2   9  23
3    3  14  21
...

22  22  18  11
23  23  16   9
24  24  17  10

distances:

      0         1         2
0   0.0  4.883116  5.570006
1   0.0  3.047896  4.105496
2   0.0  3.558557  3.567704
3   0.0  3.806127  3.880100
...

22  0.0  4.210738  4.227068
23  0.0  3.357889  3.404269
24  0.0  3.428183  3.818043

Methods

fit(self, X[, convert_dtype])

Fit GPU index for performing nearest neighbor queries.

kneighbors(self[, X, n_neighbors, …])

Query the GPU index for the k nearest neighbors of column vectors in X.

fit(self, X, convert_dtype=True)

Fit GPU index for performing nearest neighbor queries.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

kneighbors(self, X=None, n_neighbors=None, return_distance=True, convert_dtype=True)

Query the GPU index for the k nearest neighbors of column vectors in X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

n_neighborsInteger

Number of neighbors to search. If not provided, the n_neighbors from the model instance is used (default=10)

return_distance: Boolean

If False, distances will not be returned

convert_dtypebool, optional (default = True)

When set to True, the kneighbors method will automatically convert the inputs to np.float32.

Returns
distances: cuDF DataFrame or numpy ndarray

The distances of the k-nearest neighbors for each column vector in X

indices: cuDF DataFrame of numpy ndarray

The indices of the k-nearest neighbors for each column vector in X

Nearest Neighbors Classification

class cuml.neighbors.KNeighborsClassifier

K-Nearest Neighbors Classifier is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean (default = False)

Logging level

handlecumlHandle

The cumlHandle resources to use

algorithmstring (default=’brute’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

from cuml.neighbors import KNeighborsClassifier

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples=100, centers=5,
                  n_features=10)

knn = KNeighborsClassifier(n_neighbors=10)

X_train, X_test, y_train, y_test =
  train_test_split(X, y, train_size=0.80)

knn.fit(X_train, y_train)

knn.predict(X_test)

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors classifier model.

get_param_names(self)

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to predict the labels for X

predict_proba(self, X[, convert_dtype])

Use the trained k-nearest neighbors classifier to predict the label probabilities for X

score(self, X, y[, convert_dtype])

Compute the accuracy score using the given labels and the trained k-nearest neighbors classifier to predict the classes for X.

fit(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors classifier model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_outputs)

Dense matrix (floats or doubles) of shape (n_samples, n_outputs). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

get_param_names(self)
predict(self, X, convert_dtype=True)

Use the trained k-nearest neighbors classifier to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

predict_proba(self, X, convert_dtype=True)

Use the trained k-nearest neighbors classifier to predict the label probabilities for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

score(self, X, y, convert_dtype=True)

Compute the accuracy score using the given labels and the trained k-nearest neighbors classifier to predict the classes for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

Nearest Neighbors Regression

class cuml.neighbors.KNeighborsRegressor

K-Nearest Neighbors Regressor is an instance-based learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.

Parameters
n_neighborsint (default=5)

Default number of neighbors to query

verboseint or boolean (default = False)

Logging level

handlecumlHandle

The cumlHandle resources to use

algorithmstring (default=’brute’)

The query algorithm to use. Currently, only ‘brute’ is supported.

metricstring (default=’euclidean’).

Distance metric to use.

weightsstring (default=’uniform’)

Sample weights to use. Currently, only the uniform strategy is supported.

Notes

For additional docs, see scikitlearn’s KNeighborsClassifier.

Examples

from cuml.neighbors import KNeighborsRegressor

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(n_samples=100, centers=5,
                  n_features=10)

knn = KNeighborsRegressor(n_neighbors=10)

X_train, X_test, y_train, y_test =
  train_test_split(X, y, train_size=0.80)

knn.fit(X_train, y_train)

knn.predict(X_test)

Output:

array([3.        , 1.        , 1.        , 3.79999995, 2.        ,
       0.        , 3.79999995, 3.79999995, 3.79999995, 0.        ,
       3.79999995, 0.        , 1.        , 2.        , 3.        ,
       1.        , 0.        , 0.        , 0.        , 2.        ,
       3.        , 3.        , 0.        , 3.        , 3.79999995,
       3.79999995, 3.79999995, 3.79999995, 3.        , 2.        ,
       3.79999995, 3.79999995, 0.        ])

Methods

fit(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors regression model.

predict(self, X[, convert_dtype])

Use the trained k-nearest neighbors regression model to predict the labels for X

score(self, X, y[, convert_dtype])

Fit a GPU index for k-nearest neighbors regression model.

fit(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors regression model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_outputs)

Dense matrix (floats or doubles) of shape (n_samples, n_outputs). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

predict(self, X, convert_dtype=True)

Use the trained k-nearest neighbors regression model to predict the labels for X

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

score(self, X, y, convert_dtype=True)

Fit a GPU index for k-nearest neighbors regression model.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

convert_dtypebool, optional (default = True)

When set to True, the fit method will automatically convert the inputs to np.float32.

Time Series

HoltWinters

class cuml.ExponentialSmoothing

Implements a HoltWinters time series analysis model which is used in both forecasting future entries in a time series as well as in providing exponential smoothing, where weights are assigned against historical data with exponentially decreasing impact. This is done by analyzing three components of the data: level, trend, and seasonality.

Parameters
endogarray-like (device or host)

Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy. Note: cuDF.DataFrame types assumes data is in columns, while all other datatypes assume data is in rows. The endogenous dataset to be operated on.

seasonal‘additive’, ‘add’, ‘multiplicative’, ‘mul’ (default = ‘additive’) # noqa

Whether the seasonal trend should be calculated additively or multiplicatively.

seasonal_periodsint (default=2)

The seasonality of the data (how often it repeats). For monthly data this should be 12, for weekly data, this should be 7.

start_periodsint (default=2)

Number of seasons to be used for seasonal seed values

ts_numint (default=1)

The number of different time series that were passed in the endog param.

epsnp.number > 0 (default=2.24e-3)

The accuracy to which gradient descent should achieve. Note that changing this value may affect the forecasted results.

handlecuml.Handle (default=None)

If it is None, a new one is created just for this class.

Examples

from cuml import ExponentialSmoothing
import cudf
import numpy as np
data = cudf.Series([1, 2, 3, 4, 5, 6,
                   7, 8, 9, 10, 11, 12,
                   2, 3, 4, 5, 6, 7,
                   8, 9, 10, 11, 12, 13,
                   3, 4, 5, 6, 7, 8, 9,
                   10, 11, 12, 13, 14],
                   dtype=np.float64)
cu_hw = ExponentialSmoothing(data, seasonal_periods=12)
cu_hw.fit()
cu_pred = cu_hw.forecast(4)
print('Forecasted points:', cu_pred)

Output

Forecasted points :
0    4.000143766093652
1    5.000000163513641
2    6.000000000174092
3    7.000000000000178

Methods

fit(self)

Perform fitting on the given endog dataset.

forecast(self[, h, index])

Forecasts future points based on the fitted model.

get_level(self[, index])

Returns the level component of the model.

get_season(self[, index])

Returns the season component of the model.

get_trend(self[, index])

Returns the trend component of the model.

score(self[, index])

Returns the score of the model.

fit(self)

Perform fitting on the given endog dataset. Calculates the level, trend, season, and SSE components.

forecast(self, h=1, index=None)

Forecasts future points based on the fitted model.

Parameters
hint (default=1)

The number of points for each series to be forecasted.

indexint (default=None)

The index of the time series from which you want forecasted points. if None, then a cudf.DataFrame of the forecasted points from all time series is returned.

Returns
predscudf.DataFrame or cudf.Series

Series of forecasted points if index is provided. DataFrame of all forecasted points if index=None.

get_level(self, index=None)

Returns the level component of the model.

Parameters
indexint (default=None)

The index of the time series from which the level will be returned. if None, then all level components are returned in a cudf.Series.

Returns
levelcudf.Series or cudf.DataFrame

The level component of the fitted model

get_season(self, index=None)

Returns the season component of the model.

Parameters
indexint (default=None)

The index of the time series from which the season will be returned. if None, then all season components are returned in a cudf.Series.

Returns
season: cudf.Series or cudf.DataFrame

The season component of the fitted model

get_trend(self, index=None)

Returns the trend component of the model.

Parameters
indexint (default=None)

The index of the time series from which the trend will be returned. if None, then all trend components are returned in a cudf.Series.

Returns
trendcudf.Series or cudf.DataFrame

The trend component of the fitted model.

score(self, index=None)

Returns the score of the model.

**Note: Currently returns the SSE, rather than the gradient of the

LogLikelihood. https://github.com/rapidsai/cuml/issues/876

Parameters
indexint (default=None)

The index of the time series from which the SSE will be returned. if None, then all SSEs are returned in a cudf Series.

Returns
scorenp.float32, np.float64, or cudf.Series

The SSE of the fitted model.

ARIMA

class cuml.tsa.ARIMA

Implements a batched ARIMA model for in- and out-of-sample time-series prediction, with support for seasonality (SARIMA)

ARIMA stands for Auto-Regressive Integrated Moving Average. See https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average

This class can fit an ARIMA(p,d,q) or ARIMA(p,d,q)(P,D,Q)_s model to a batch of time series of the same length with no missing values. The implementation is designed to give the best performance when using large batches of time series.

Parameters
ydataframe or array-like (device or host)

The time series data, assumed to have each time series in columns. Acceptable formats: cuDF DataFrame, cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy.

orderTuple[int, int, int]

The ARIMA order (p, d, q) of the model

seasonal_order: Tuple[int, int, int, int]

The seasonal ARIMA order (P, D, Q, s) of the model

fit_interceptbool or int

Whether to include a constant trend mu in the model (default: True)

handlecuml.Handle

If it is None, a new one is created just for this instance

verboseint or boolean (default = False)

Controls verbose level of logging.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’}, optional

Variable to control output type of the results and attributes of the estimators. If None, it’ll inherit the output type set at the module level, cuml.output_type. If set, the estimator will override the global option for its behavior.

References

This class is heavily influenced by the Python library statsmodels, particularly statsmodels.tsa.statespace.sarimax.SARIMAX. See https://www.statsmodels.org/stable/statespace.html

Additionally the following book is a useful reference: “Time Series Analysis by State Space Methods”, J. Durbin, S.J. Koopman, 2nd Edition (2012).

Examples

import numpy as np
from cuml.tsa.arima import ARIMA

# Create seasonal data with a trend, a seasonal pattern and noise
n_obs = 100
np.random.seed(12)
x = np.linspace(0, 1, n_obs)
pattern = np.array([[0.05, 0.0], [0.07, 0.03],
                    [-0.03, 0.05], [0.02, 0.025]])
noise = np.random.normal(scale=0.01, size=(n_obs, 2))
y = (np.column_stack((0.5*x, -0.25*x)) + noise
     + np.tile(pattern, (25, 1)))

# Fit a seasonal ARIMA model
model = ARIMA(y, (0,1,1), (0,1,1,4), fit_intercept=False)
model.fit()

# Forecast
fc = model.forecast(10)
print(fc)

Output:

[[ 0.55204599 -0.25681163]
 [ 0.57430705 -0.2262438 ]
 [ 0.48120315 -0.20583011]
 [ 0.535594   -0.24060046]
 [ 0.57207541 -0.26695497]
 [ 0.59433647 -0.23638713]
 [ 0.50123257 -0.21597344]
 [ 0.55562342 -0.25074379]
 [ 0.59210483 -0.27709831]
 [ 0.61436589 -0.24653047]]
Attributes
orderTuple[int, int, int]

The ARIMA order (p, d, q) of the model

seasonal_order: Tuple[int, int, int, int]

The seasonal ARIMA order (P, D, Q, s) of the model

interceptbool or int

Whether the model includes a constant trend mu

d_y: device array

Time series data on device

num_samples: int

Number of observations

batch_size: int

Number of time series in the batch

dtype: numpy.dtype

Floating-point type of the data and parameters

niter: numpy.ndarray

After fitting, contains the number of iterations before convergence for each time series.

Methods

fit(self, start_params, object], …[, maxiter])

Fit the ARIMA model to each time series.

forecast(self, nsteps)

Forecast the given model nsteps into the future.

get_params(self)

Get the parameters of the model

pack(self)

Pack parameters of the model into a linearized vector x

predict(self[, start, end])

Compute in-sample and/or out-of-sample prediction for each series

set_params(self, params, object])

Set the parameters of the model

unpack(self, x, numpy.ndarray])

Unpack linearized parameter vector x into the separate parameter arrays of the model

property aic

Akaike Information Criterion

property aicc

Corrected Akaike Information Criterion

property bic

Bayesian Information Criterion

property complexity

Model complexity (number of parameters)

fit(self, start_params: Union[Mapping[str, object], NoneType] = None, opt_disp: ‘int’ = - 1, h: ‘float’ = 1e-09, maxiter=1000)

Fit the ARIMA model to each time series.

Parameters
start_paramsMapping[str, object] (optional)

A mapping (e.g dictionary) of parameter names and associated arrays The key names are in {“mu”, “ar”, “ma”, “sar”, “sma”, “sigma2”} The shape of the arrays are (batch_size,) for mu parameters and (n, batch_size) for any other type, where n is the corresponding number of parameters of this type. Pass None for automatic estimation (recommended)

opt_dispint
Fit diagnostic level (for L-BFGS solver):
  • -1 for no output (default)

  • 0<n<100 for output every n steps

  • n>100 for more detailed output

hfloat

Finite-differencing step size. The gradient is computed using second-order differencing:

f(x+h) - f(x - h)

g = —————– + O(h^2)

2 * h

maxiterint

Maximum number of iterations of L-BFGS-B

forecast(self, nsteps: ‘int’)

Forecast the given model nsteps into the future.

nstepsint

The number of steps to forecast beyond end of the given series

get_params(self) → Dict[str, numpy.ndarray]

Get the parameters of the model

property llf

Log-likelihood of a fit model. Shape: (batch_size,)

pack(self) → numpy.ndarray

Pack parameters of the model into a linearized vector x

predict(self, start=0, end=None)

Compute in-sample and/or out-of-sample prediction for each series

set_params(self, params: Mapping[str, object])

Set the parameters of the model

params: Mapping[str, np.ndarray]

A mapping (e.g dictionary) of parameter names and associated arrays The key names are in {“mu”, “ar”, “ma”, “sar”, “sma”, “sigma2”} The shape of the arrays are (batch_size,) for mu parameters and (n, batch_size) for any other type, where n is the corresponding number of parameters of this type.

unpack(self, x: Union[list, numpy.ndarray])

Unpack linearized parameter vector x into the separate parameter arrays of the model

Multi-Node, Multi-GPU Algorithms

K-Means Clustering

class cuml.dask.cluster.KMeans(client=None, verbose=False, **kwargs)

Multi-Node Multi-GPU implementation of KMeans.

This version minimizes data transfer by sharing only the centroids between workers in each iteration.

Predictions are done embarrassingly parallel, using cuML’s single-GPU version.

For more information on this implementation, refer to the documentation for single-GPU K-Means.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class.

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseint or boolean (default = False)

Logging level for printing diagnostic information

random_stateint (default = 1)

If you want results to be the same when you restart Python, select a state.

init{‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray}

(default = ‘scalable-k-means++’)

‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ intialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

oversampling_factorint (default = 2) The amount of points to sample

in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.

max_samples_per_batchint (default = 32768) The number of data

samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

Attributes
cluster_centers_cuDF DataFrame or CuPy ndarray

The coordinates of the final clusters. This represents of “mean” of each data cluster.

Methods

fit(self, X)

Fit a multi-node multi-GPU KMeans model

fit_predict(self, X[, delayed])

Compute cluster centers and predict cluster index for each sample.

fit_transform(self, X[, delayed])

Calls fit followed by transform using a distributed KMeans model

predict(self, X[, delayed])

Predict labels for the input

score(self, X)

Computes the inertia score for the trained KMeans centroids.

transform(self, X[, delayed])

Transforms the input into the learned centroid space

get_param_names

fit(self, X)

Fit a multi-node multi-GPU KMeans model

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array
Training data to cluster.
fit_predict(self, X, delayed=True)

Compute cluster centers and predict cluster index for each sample.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Data to predict

Returns
result: Dask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing predictions

fit_transform(self, X, delayed=True)

Calls fit followed by transform using a distributed KMeans model

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Data to predict

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
result: Dask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed data

predict(self, X, delayed=True)

Predict labels for the input

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Data to predict

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
result: Dask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing predictions

score(self, X)

Computes the inertia score for the trained KMeans centroids.

Parameters
Xdask_cudf.Dataframe

Dataframe to compute score

Returns
Inertial score
transform(self, X, delayed=True)

Transforms the input into the learned centroid space

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Data to predict

delayedbool (default = True)

Whether to execute as a delayed task or eager.

Returns
result: Dask cuDF DataFrame or CuPy backed Dask Array

Distributed object containing the transformed data

Nearest Neighbors

class cuml.dask.neighbors.NearestNeighbors(client=None, streams_per_handle=0, **kwargs)

Multi-node Multi-GPU NearestNeighbors Model.

Methods

fit(self, X)

Fit a multi-node multi-GPU Nearest Neighbors index

get_neighbors(self, n_neighbors)

Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.

kneighbors(self[, X, n_neighbors, …])

Query the distributed nearest neighbors index

fit(self, X)

Fit a multi-node multi-GPU Nearest Neighbors index

Parameters
Xdask_cudf.Dataframe
Returns
self: NearestNeighbors model
get_neighbors(self, n_neighbors)

Returns the default n_neighbors, initialized from the constructor, if n_neighbors is None.

Parameters
n_neighborsint

Number of neighbors

Returns
n_neighbors: int

Default n_neighbors if parameter n_neighbors is none

kneighbors(self, X=None, n_neighbors=None, return_distance=True, _return_futures=False)

Query the distributed nearest neighbors index

Parameters
Xdask_cudf.Dataframe

Vectors to query. If not provided, neighbors of each indexed point are returned.

n_neighborsint

Number of neighbors to query for each row in X. If not provided, the n_neighbors on the model are used.

return_distanceboolean (default=True)

If false, only indices are returned

Returns
rettuple (dask_cudf.DataFrame, dask_cudf.DataFrame)

First dask-cuDF DataFrame contains distances, second conains the indices.

Principal Component Analysis

class cuml.dask.decomposition.PCA(client=None, verbose=False, **kwargs)

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s multi-node multi-gpu (MNMG) PCA expects a dask cuDF input, and provides a “Full” algorithm. It uses a full eigendecomposition then selects the top K eigenvectors.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

svd_solver‘full’

Only Full algorithm is supported since it’s significantly faster on GPU then the other solvers including randomized SVD.

verboseint or boolean (default = False)

Logging level

whitenboolean (default = False)

If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Notes

PCA considers linear combinations of features, specifically those that maximise global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For an additional example see the PCA notebook. For additional docs, see scikitlearn’s PCA.

Examples

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import numpy as np
from cuml.dask.decomposition import PCA
from cuml.dask.datasets import make_blobs

cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

nrows = 6
ncols = 3
n_parts = 2

X_cudf, _ = make_blobs(nrows, ncols, 1, n_parts,
                cluster_std=0.01,
                verbose=cuml.logger.level_info,
                random_state=10, dtype=np.float32)

wait(X_cudf)

print("Input Matrix")
print(X_cudf.compute())

cumlModel = PCA(n_components = 1, whiten=False)
XT = cumlModel.fit_transform(X_cudf)

print("Transformed Input Matrix")
print(XT.compute())

Output:

Input Matrix:
          0         1         2
          0 -6.520953  0.015584 -8.828546
          1 -6.507554  0.016524 -8.836799
          2 -6.518214  0.010457 -8.821301
          0 -6.520953  0.015584 -8.828546
          1 -6.507554  0.016524 -8.836799
          2 -6.518214  0.010457 -8.821301

Transformed Input Matrix:
                    0
          0 -0.003271
          1  0.011454
          2 -0.008182
          0 -0.003271
          1  0.011454
          2 -0.008182
Note: Everytime this code is run, the output will be different because

“make_blobs” function generates random matrices.

Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

mean_array

The column wise mean of X. Used to mean - center the data first.

noise_variance_float

From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

fit(self, X)

Fit the model with X.

fit_transform(self, X)

Fit the model with X and apply the dimensionality reduction on X.

inverse_transform(self, X[, delayed])

Transform data back to its original space.

transform(self, X[, delayed])

Apply dimensionality reduction to X.

get_param_names

fit(self, X)

Fit the model with X.

Parameters
Xdask cuDF input
fit_transform(self, X)

Fit the model with X and apply the dimensionality reduction on X.

Parameters
Xdask cuDF
Returns
X_newdask cuDF
inverse_transform(self, X, delayed=True)

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters
Xdask cuDF
Returns
X_originaldask cuDF
transform(self, X, delayed=True)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters
Xdask cuDF
Returns
X_newdask cuDF

Random Forest

class cuml.dask.ensemble.RandomForestClassifier(workers=None, client=None, verbose=False, n_estimators=10, seed=None, **kwargs)

Experimental API implementing a multi-GPU Random Forest classifier model which fits multiple decision tree classifiers in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).

Currently, this API makes the following assumptions: * The set of Dask workers used between instantiation, fit, and predict are all consistent * Training data comes in the form of cuDF dataframes, distributed so that each worker has at least one partition.

Future versions of the API will support more flexible data distribution and additional input types.

The distributed algorithm uses an embarrassingly-parallel approach. For a forest with N trees being built on w workers, each worker simply builds N/w trees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so that rf.fit receives w partitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.

Please check the single-GPU implementation of Random Forest classifier for more information about the underlying algorithm.

Parameters
n_estimatorsint (default = 10)

total number of trees in the forest (not per-worker)

handlecuml.Handle

If it is None, a new one is created just for this class.

split_criterionThe criterion used to split nodes.

0 for GINI, 1 for ENTROPY, 4 for CRITERION_END. 2 and 3 not valid for classification (default = 0)

split_algo0 for HIST and 1 for GLOBAL_QUANTILE

(default = 1) the algorithm to determine how nodes are split in the tree.

split_criterionThe criterion used to split nodes.

0 for GINI, 1 for ENTROPY, 4 for CRITERION_END. 2 and 3 not valid for classification (default = 0)

bootstrapboolean (default = True)

Control bootstrapping. If set, each tree in the forest is built on a bootstrapped sample with replacement. If false, sampling without replacement is done.

bootstrap_featuresboolean (default = False)

Control bootstrapping for features. If features are drawn with or without replacement

rows_samplefloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = -1)

Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.

max_featuresfloat (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

n_binsint (default = 8)

Number of bins used by the split algorithm.

min_rows_per_nodeint (default = 2)

The minimum number of samples (rows) needed to split a node.

quantile_per_treeboolean (default = False)

Whether quantile is computed for individual RF trees. Only relevant for GLOBAL_QUANTILE split_algo.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used.

seedint (default = None)

Base seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.

Examples

For usage examples, please see the RAPIDS notebooks repository: https://github.com/rapidsai/notebooks/blob/branch-0.12/cuml/random_forest_mnmg_demo.ipynb

Methods

fit(self, X, y[, convert_dtype])

Fit the input data with a Random Forest classifier

get_params(self[, deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(self, X[, output_class, algo, …])

Predicts the labels for X.

predict_model_on_cpu(self, X[, convert_dtype])

Predicts the labels for X.

predict_proba(self, X[, delayed])

Predicts the probability of each class for X.

print_summary(self)

Print the summary of the forest used to train and test the model.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

predict_using_fil

fit(self, X, y, convert_dtype=False)

Fit the input data with a Random Forest classifier

IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).

If a worker has multiple data partitions, they will be concatenated before fitting, which will lead to additional memory usage. To minimize memory consumption, ensure that each worker has exactly one partition.

When persisting data, you can use cuml.dask.common.utils.persist_across_workers to simplify this:

X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client,
                                                  [X_dask_cudf,
                                                   y_dask_cudf])
(this is equivalent to calling persist with the data and workers)::
X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,

y_dask_cudf],

workers={ X_dask_cudf=workers, y_dask_cudf=workers })

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Labels of training examples. y must be partitioned the same way as X

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters
deepboolean (default = True)
predict(self, X, output_class=True, algo='auto', threshold=0.5, convert_dtype=True, predict_model='GPU', fil_sparse_format='auto', delayed=True)

Predicts the labels for X.

GPU-based prediction in a multi-node, multi-GPU context works by sending the sub-forest from each worker to the client, concatenating these into one forest with the full n_estimators set of trees, and sending this combined forest to the workers, which will each infer on their local set of data. Within the worker, this uses the cuML Forest Inference Library (cuml.fil) for high-throughput prediction.

This allows inference to scale to large datasets, but the forest transmission incurs overheads for very large trees. For inference on small datasets, this overhead may dominate prediction time.

The ‘CPU’ fallback method works with sub-forests in-place, broadcasting the datasets to all workers and combining predictions via a voting method at the end. This method is slower on a per-row basis but may be faster for problems with many trees and few rows.

In the 0.15 cuML release, inference will be updated with much faster tree transfer. Preliminary builds with this updated approach will be available from rapids.ai

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block algo - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU, that is for, predict_model=’GPU’. It is applied if output_class == True, else it is ignored

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one. It is not required while using predict_model=’CPU’.

Returns
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
predict_model_on_cpu(self, X, convert_dtype=True)

Predicts the labels for X.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

Returns
———-
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
predict_proba(self, X, delayed=True, **kwargs)

Predicts the probability of each class for X.

See documentation of `predict’ for notes on performance.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for binary classification problems.

output_classboolean (default = True)

This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block auto - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

thresholdfloat (default = 0.5)

Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored

num_classesint (default = 2)

number of different classes present in the dataset

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

Returns
yNumPy

Dask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)

print_summary(self)

Print the summary of the forest used to train and test the model.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters
paramsdict of new params.
class cuml.dask.ensemble.RandomForestRegressor(workers=None, client=None, verbose=False, n_estimators=10, seed=None, **kwargs)

Experimental API implementing a multi-GPU Random Forest classifier model which fits multiple decision tree classifiers in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).

Currently, this API makes the following assumptions: * The set of Dask workers used between instantiation, fit, and predict are all consistent * Training data is comes in the form of cuDF dataframes, distributed so that each worker has at least one partition.

Future versions of the API will support more flexible data distribution and additional input types. User-facing APIs are expected to change in upcoming versions.

The distributed algorithm uses an embarrassingly-parallel approach. For a forest with N trees being built on w workers, each worker simply builds N/w trees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so that rf.fit receives w partitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.

Please check the single-GPU implementation of Random Forest classifier for more information about the underlying algorithm.

Parameters
n_estimatorsint (default = 10)

total number of trees in the forest (not per-worker)

handlecuml.Handle

If it is None, a new one is created just for this class.

split_algoint (default = 1)

0 for HIST, 1 for GLOBAL_QUANTILE The type of algorithm to be used to create the trees.

split_criterionint (default = 2)

The criterion used to split nodes. 0 for GINI, 1 for ENTROPY, 2 for MSE, 3 for MAE and 4 for CRITERION_END. 0 and 1 not valid for regression

bootstrapboolean (default = True)

Control bootstrapping. If set, each tree in the forest is built on a bootstrapped sample with replacement. If false, sampling without replacement is done.

bootstrap_featuresboolean (default = False)

Control bootstrapping for features. If features are drawn with or without replacement

rows_samplefloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = -1)

Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.

max_featuresint or float or string or None (default = ‘auto’)

Ratio of number of features (columns) to consider per node split. If int then max_features/n_features. If float then max_features is a fraction. If ‘auto’ then max_features=n_features which is 1.0. If ‘sqrt’ then max_features=1/sqrt(n_features). If ‘log2’ then max_features=log2(n_features)/n_features. If None, then max_features=n_features which is 1.0.

n_binsint (default = 8)

Number of bins used by the split algorithm.

min_rows_per_nodeint or float (default = 2)

The minimum number of samples (rows) needed to split a node. If int then number of sample rows If float the min_rows_per_sample*n_rows

accuracy_metricstring (default = ‘mse’)

Decides the metric used to evaluate the performance of the model. for median of abs error : ‘median_ae’ for mean of abs error : ‘mean_ae’ for mean square error’ : ‘mse’

n_streamsint (default = 4 )

Number of parallel streams used for forest building

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used.

seedint (default = None)

Base seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results.

Methods

fit(self, X, y[, convert_dtype])

Fit the input data with a Random Forest regression model

get_params(self[, deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(self, X[, predict_model, algo, …])

Predicts the regressor outputs for X.

print_summary(self)

Print the summary of the forest used to train and test the model.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

predict_model_on_cpu

predict_using_fil

fit(self, X, y, convert_dtype=False)

Fit the input data with a Random Forest regression model

IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).

When persisting data, you can use cuml.dask.common.utils.persist_across_workers to simplify this:

X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client,
                                                  [X_dask_cudf,
                                                   y_dask_cudf])
(this is equivalent to calling persist with the data and workers)::
X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,

y_dask_cudf],

workers={ X_dask_cudf=workers, y_dask_cudf=workers })

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Labels of training examples. y must be partitioned the same way as X

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_params(self, deep=True)

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters
deepboolean (default = True)
predict(self, X, predict_model='GPU', algo='auto', convert_dtype=True, fil_sparse_format='auto', delayed=True)

Predicts the regressor outputs for X.

GPU-based prediction in a multi-node, multi-GPU context works by sending the sub-forest from each worker to the client, concatenating these into one forest with the full n_estimators set of trees, and sending this combined forest to the workers, which will each infer on their local set of data. This allows inference to scale to large datasets, but the forest transmission incurs overheads for very large trees. For inference on small datasets, this overhead may dominate prediction time. Within the worker, this uses the cuML Forest Inference Library (cuml.fil) for high-throughput prediction.

The ‘CPU’ fallback method works with sub-forests in-place, broadcasting the datasets to all workers and combining predictions via an averaging method at the end. This method is slower on a per-row basis but may be faster for problems with many trees and few rows.

In the 0.15 cuML release, inference will be updated with much faster tree transfer.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

algostring (default = ‘auto’)

This is optional and required only while performing the predict operation on the GPU. ‘naive’ - simple inference using shared memory ‘tree_reorg’ - similar to naive but trees rearranged to be more coalescing-friendly ‘batch_tree_reorg’ - similar to tree_reorg but predicting multiple rows per thread block algo - choose the algorithm automatically. Currently ‘batch_tree_reorg’ is used for dense storage and ‘naive’ for sparse storage

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

predict_modelString (default = ‘GPU’)

‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and X is float32 or convert_dtype is set to True.

fil_sparse_formatboolean or string (default = auto)

This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’. ‘auto’ - choose the storage type automatically (currently True is chosen by auto) False - create a dense forest True - create a sparse forest, requires algo=’naive’ or algo=’auto’

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
print_summary(self)

Print the summary of the forest used to train and test the model.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters
paramsdict of new params

Truncated SVD

class cuml.dask.decomposition.TruncatedSVD(client=None, **kwargs)
Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

svd_solver‘full’

Only Full algorithm is supported since it’s significantly faster on GPU then the other solvers including randomized SVD.

verboseint or boolean (default = False)

Logging level

Examples

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import numpy as np
from cuml.dask.decomposition import TruncatedSVD
from cuml.dask.datasets import make_blobs

cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

nrows = 6
ncols = 3
n_parts = 2

X_cudf, _ = make_blobs(nrows, ncols, 1, n_parts,
                cluster_std=1.8,
                verbose=cuml.logger.level_info,
                random_state=10, dtype=np.float32)

wait(X_cudf)

print("Input Matrix")
print(X_cudf.compute())

cumlModel = TruncatedSVD(n_components = 1)
XT = cumlModel.fit_transform(X_cudf)

print("Transformed Input Matrix")
print(XT.compute())

Output:

Input Matrix:
                    0         1          2
          0 -8.519647 -8.519222  -8.865648
          1 -6.107700 -8.350124 -10.351215
          2 -8.026635 -9.442240  -7.561770
          0 -8.519647 -8.519222  -8.865648
          1 -6.107700 -8.350124 -10.351215
          2 -8.026635 -9.442240  -7.561770

Transformed Input Matrix:
                     0
          0  14.928891
          1  14.487295
          2  14.431235
          0  14.928891
          1  14.487295
          2  14.431235
Note: Everytime this code is run, the output will be different because

“make_blobs” function generates random matrices.

Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

Methods

fit(self, X[, _transform])

Fit the model with X.

fit_transform(self, X)

Fit the model with X and apply the dimensionality reduction on X.

inverse_transform(self, X[, delayed])

Transform data back to its original space.

transform(self, X[, delayed])

Apply dimensionality reduction to X.

get_param_names

fit(self, X, _transform=False)

Fit the model with X.

Parameters
Xdask cuDF input
fit_transform(self, X)

Fit the model with X and apply the dimensionality reduction on X.

Parameters
Xdask cuDF
Returns
X_newdask cuDF
inverse_transform(self, X, delayed=True)

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters
Xdask cuDF
Returns
X_originaldask cuDF
transform(self, X, delayed=True)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters
Xdask cuDF
Returns
X_newdask cuDF

Manifold

class cuml.dask.manifold.UMAP(model, client=None, **kwargs)

Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py

Notes

This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap: * Using a non-Euclidean distance metric (support for a fixed set

of non-Euclidean metrics is planned for an upcoming release).

  • Using a pre-computed pairwise distance matrix (under consideration

    for future releases)

  • Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.

Known issue: If a UMAP model has not yet been fit, it cannot be pickled

References

  • Leland McInnes, John Healy, James Melville

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/abs/1802.03426

Examples

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from cuml.dask.datasets import make_blobs
from cuml.manifold import UMAP
from cuml.dask.manifold import UMAP as MNMG_UMAP
import numpy as np

cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)

X, y = make_blobs(1000, 10,
                centers=42,
                cluster_std=0.1,
                dtype=np.float32,
                n_parts=2,
                output='array')

local_model = UMAP()

selection = np.random.choice(1000, 100)
X_train = X[selection].compute()
y_train = y[selection].compute()

local_model.fit(X_train, y=y_train)

distributed_model = MNMG_UMAP(local_model)
embedding = distributed_model.transform(X)
Note: Everytime this code is run, the output will be different because

“make_blobs” function generates random matrices.

Methods

transform(self, X[, convert_dtype])

Transform X into the existing embedded space and return that transformed output.

transform(self, X, convert_dtype=True)

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data to be transformed. Acceptable formats: dask cuDF, dask CuPy/NumPy/Numba Array

Returns
X_newarray, shape (n_samples, n_components)

Embedding of the new data in low-dimensional space.

Linear Models

class cuml.dask.linear_model.LinearRegression(client=None, verbose=False, **kwargs)

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s dask Linear Regression (multi-node multi-gpu) expects dask cuDF DataFrame and provides an algorithms, Eig, to fit a linear model. And provides an eigendecomposition-based algorithm to fit a linear model. (SVD, which is more stable than eig, will be added in an upcoming version.) Eig algorithm is usually preferred when the X is a tall and skinny matrix. As the number of features in X increases, the accuracy of Eig algorithm drops.

This is an experimental implementation of dask Linear Regresion. It supports input X that has more than one column. Single column input X will be supported after SVD algorithm is added in an upcoming version.

Parameters
algorithm‘eig’

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.

fit_interceptboolean (default = True)

LinearRegression adds an additional term c to correct for the global mean of y, modeling the reponse as “x * beta + c”. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by its L2 norm. If False, no scaling will be done.

Attributes
coef_cuDF series, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X[, delayed])

Make predictions for X and returns a dask collection.

get_param_names

fit(self, X, y)

Fit the model with X and y.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Features for regression

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Labels (outcome values)

predict(self, X, delayed=True)

Make predictions for X and returns a dask collection.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
class cuml.dask.linear_model.Ridge(client=None, verbose=False, **kwargs)

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s dask Ridge (multi-node multi-gpu) expects dask cuDF DataFrame and provides an algorithms, Eig, to fit a linear model. And provides an eigendecomposition-based algorithm to fit a linear model. (SVD, which is more stable than eig, will be added in an upcoming version) Eig algorithm is usually preferred when the X is a tall and skinny matrix. As the number of features in X increases, the accuracy of Eig algorithm drops.

This is an experimental implementation of dask Ridge Regresion. It supports input X that has more than one column. Single column input X will be supported after SVD algorithm is added in an upcoming version.

Parameters
alphafloat (default = 1.0)

Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.

solver{‘eig’}

Eig uses a eigendecomposition of the covariance matrix, and is much faster. Other solvers will be supported in the future.

fit_interceptboolean (default = True)

If True, Ridge adds an additional term c to correct for the global mean of y, modeling the reponse as “x * beta + c”. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X[, delayed])

Make predictions for X and returns a dask collection.

get_param_names

fit(self, X, y)

Fit the model with X and y.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Features for regression

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Labels (outcome values)

predict(self, X, delayed=True)

Make predictions for X and returns a dask collection.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
class cuml.dask.linear_model.Lasso(client=None, **kwargs)

Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.

cuML’s Lasso an array-like object or cuDF DataFrame and uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

For additional docs, see `scikitlearn’s Lasso
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X[, delayed])

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

yDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

predict(self, X, delayed=True)

Predicts the y for X.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

class cuml.dask.linear_model.ElasticNet(client=None, **kwargs)

ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters
alphafloat (default = 1.0)

Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

l1_ratio: float (default = 0.5)

The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint (default = 1000)

The maximum number of iterations

tolfloat (default = 1e-3)

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selection{‘cyclic’, ‘random’} (default=’cyclic’)

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

handlecuml.Handle

If it is None, a new one is created just for this class.

output_type(optional) {‘input’, ‘cudf’, ‘cupy’, ‘numpy’} default = None

Use it to control output type of the results and attributes. If None it’ll inherit the output type set at the module level, cuml.output_type. If that has not been changed, by default the estimator will mirror the type of the data used for each fit or predict call. If set, the estimator will override the global option for its behavior.

Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

For additional docs, see `scikitlearn’s ElasticNet
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html>`_.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X[, delayed])

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

yDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

predict(self, X, delayed=True)

Predicts the y for X.

Parameters
XDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF DataFrame or CuPy backed Dask Array

Dense matrix (floats or doubles) of shape (n_samples, n_features).

Solvers

class cuml.dask.solvers.CD(client=None, **kwargs)

Model-Parallel Multi-GPU Linear Regression Model. Single Process Multi GPU supported currently

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X[, delayed])

Make predictions for X and returns a dask collection.

fit(self, X, y)

Fit the model with X and y.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Features for regression

yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Labels (outcome values)

predict(self, X, delayed=True)

Make predictions for X and returns a dask collection.

Parameters
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

Returns
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)

Dask Base Classes and Mixins

class cuml.dask.common.base.BaseEstimator(client=None, verbose=False, **kwargs)
class cuml.dask.common.base.DelayedParallelFunc
class cuml.dask.common.base.DelayedPredictionMixin
class cuml.dask.common.base.DelayedTransformMixin
class cuml.dask.common.base.DelayedInverseTransformMixin