cuML API Reference¶
Module Configuration¶
Output Data Type Configuration¶
cuml.common.memory_utils.
set_global_output_type
(output_type)[source]¶Method to set cuML’s single GPU estimators global output type. It will be used by all estimators unless overriden in their initialization with their own output_type parameter. Can also be overriden by the context manager method
using_output_type()
.
 Parameters
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.
'input'
will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:
Input type
Output type
cuDF DataFrame or Series
cuDF DataFrame or Series
NumPy arrays
NumPy arrays
Pandas DataFrame or Series
NumPy arrays
Numba device arrays
Numba device arrays
CuPy arrays
CuPy arrays
Other
__cuda_array_interface__
objsCuPy arrays
'cudf'
will return cuDF Series for single dimensional results and DataFrames for the rest.
'cupy'
will return CuPy arrays.
'numpy'
will return NumPy arrays.Notes
'cupy'
and'numba'
options (as well as'input'
when using Numba and CuPy ndarrays for input) have the least overhead. cuDF add memory consumption and processing time needed to build the Series and DataFrames.'numpy'
has the biggest overhead due to the need to transfer data to CPU memory.Examples
import cuml import cupy as cp ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]] ary = cp.asarray(ary) cuml.set_global_output_type('cudf'): dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1) dbscan_float.fit(ary) print("cuML output type") print(dbscan_float.labels_) print(type(dbscan_float.labels_))Output:
cuML output type 0 0 1 1 2 2 dtype: int32 <class 'cudf.core.series.Series'>
cuml.common.memory_utils.
using_output_type
(output_type)[source]¶Context manager method to set cuML’s global output type inside a
with
statement. It gets reset to the prior value it had once thewith
code block is executer.
 Parameters
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’} (default = ‘input’)
Desired output type of results and attributes of the estimators.
'input'
will mean that the parameters and methods will mirror the format of the data sent to the estimators/methods as much as possible. Specifically:
Input type
Output type
cuDF DataFrame or Series
cuDF DataFrame or Series
NumPy arrays
NumPy arrays
Pandas DataFrame or Series
NumPy arrays
Numba device arrays
Numba device arrays
CuPy arrays
CuPy arrays
Other
__cuda_array_interface__
objsCuPy arrays
'cudf'
will return cuDF Series for single dimensional results and DataFrames for the rest.
'cupy'
will return CuPy arrays.
'numpy'
will return NumPy arrays.Examples
import cuml import cupy as cp ary = [[1.0, 4.0, 4.0], [2.0, 2.0, 2.0], [5.0, 1.0, 1.0]] ary = cp.asarray(ary) with cuml.using_output_type('cudf'): dbscan_float = cuml.DBSCAN(eps=1.0, min_samples=1) dbscan_float.fit(ary) print("cuML output inside 'with' context") print(dbscan_float.labels_) print(type(dbscan_float.labels_)) # use cuml again outside the context manager dbscan_float2 = cuml.DBSCAN(eps=1.0, min_samples=1) dbscan_float2.fit(ary) print("cuML default output") print(dbscan_float2.labels_) print(type(dbscan_float2.labels_))Output:
cuML output inside 'with' context 0 0 1 1 2 2 dtype: int32 <class 'cudf.core.series.Series'> cuML default output [0 1 2] <class 'cupy.core.core.ndarray'>
Verbosity Levels¶
cuML follows a verbosity model similar to Scikitlearn’s: The verbose parameter can be a boolean, or a numeric value, and higher numeric values mean more verbosity. The exact values can be set directly, or through the cuml.common.logger module, and they are:
Numeric value 
cuml.common.logger value 
Verbosity level 

0 
cuml.common.logger.level_off 
Disables all log messages 
1 
cuml.common.logger.level_critical 
Enables only critical messages 
2 
cuml.common.logger.level_error 
Enables all messages up to and including errors. 
3 
cuml.common.logger.level_warn 
Enables all messages up to and including warnings. 
4 or False 
cuml.common.logger.level_info 
Enables all messages up to and including information messages. 
5 or True 
cuml.common.logger.level_debug 
Enables all messages up to and including debug messages. 
6 
cuml.common.logger.level_trace 
Enables all messages up to and including trace messages. 
Preprocessing, Metrics, and Utilities¶
Model Selection and Data Splitting¶
cuml.preprocessing.model_selection.
train_test_split
(X, y=None, test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, random_state: Optional[Union[int, cupy.random._generator.RandomState, numpy.random.mtrand.RandomState]] = None, seed: Optional[Union[int, cupy.random._generator.RandomState, numpy.random.mtrand.RandomState]] = None, stratify=None)[source]¶Partitions device data into four collated objects, mimicking Scikitlearn’s train_test_split.
 Parameters
 Xcudf.DataFrame or cuda_array_interface compliant device array
Data to split, has shape (n_samples, n_features)
 ystr, cudf.Series or cuda_array_interface compliant device array
Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X (if it is a cuDF DataFrame) containing the labels
 train_sizefloat or int, optional
If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8
 shufflebool, optional
Whether or not to shuffle inputs before splitting
 random_stateint, CuPy RandomState or NumPy RandomState optional
If shuffle is true, seeds the generator. Unseeded by default
 seed: random_stateint, CuPy RandomState or NumPy RandomState optional
If shuffle is true, seeds the generator. Unseeded by default
Deprecated since version 0.11: Parameter
seed
is deprecated and will be removed in 0.17. Please userandom_state
instead stratify: bool, optional
Whether to stratify the input data based on class labels. None by default
 Returns
 X_train, X_test, y_train, y_testcudf.DataFrame or arraylike objects
Partitioned dataframes if X and y were cuDF objects. If
y
was provided as a column name, the column was dropped fromX
. Partitioned numba device arrays if X and y were Numba device arrays. Partitioned CuPy arrays for any other input.Examples
import cudf from cuml.model_selection import train_test_split # Generate some sample data df = cudf.DataFrame({'x': range(10), 'y': [0, 1] * 5}) print(f'Original data: {df.shape[0]} elements') # Suppose we want an 80/20 split X_train, X_test, y_train, y_test = train_test_split(df, 'y', train_size=0.8) print(f'X_train: {X_train.shape[0]} elements') print(f'X_test: {X_test.shape[0]} elements') print(f'y_train: {y_train.shape[0]} elements') print(f'y_test: {y_test.shape[0]} elements') # Alternatively, if our labels are stored separately labels = df['y'] df = df.drop(['y'], axis=1) # we can also do X_train, X_test, y_train, y_test = train_test_split(df, labels, train_size=0.8)Output:
Original data: 10 elements X_train: 8 elements X_test: 2 elements y_train: 8 elements y_test: 2 elements
Feature and Label Encoding (SingleGPU)¶
 class
cuml.preprocessing.LabelEncoder.
LabelEncoder
(*, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]¶An nvcategory based implementation of ordinal label encoding
 Parameters
 handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform or inverse transform, the resulting encoding will be null.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.Examples
Converting a categorical implementation to a numerical one
from cudf import DataFrame, Series data = DataFrame({'category': ['a', 'b', 'c', 'd']}) # There are two functionally equivalent ways to do this le = LabelEncoder() le.fit(data.category) # le = le.fit(data.category) also works encoded = le.transform(data.category) print(encoded) # This method is preferred le = LabelEncoder() encoded = le.fit_transform(data.category) print(encoded) # We can assign this to a new column data = data.assign(encoded=encoded) print(data.head()) # We can also encode more data test_data = Series(['c', 'a']) encoded = le.transform(test_data) print(encoded) # After train, ordinal label can be inverse_transform() back to # string labels ord_label = cudf.Series([0, 0, 1, 2, 1]) ord_label = dask_cudf.from_cudf(data, npartitions=2) str_label = le.inverse_transform(ord_label) print(str_label)Output:
0 0 1 1 2 2 3 3 dtype: int64 0 0 1 1 2 2 3 3 dtype: int32 category encoded 0 a 0 1 b 1 2 c 2 3 d 3 0 2 1 0 dtype: int64 0 a 1 a 2 b 3 c 4 b dtype: objectMethods
fit
(y[, _classes])Fit a LabelEncoder (nvcategory) instance to a set of categories
fit_transform
(y[, z])Simultaneously fit and transform an input
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
Revert ordinal label to original label
transform
(y)Transform an input into its categorical keys.
fit
(y, _classes=None)[source]¶Fit a LabelEncoder (nvcategory) instance to a set of categories
 Parameters
 ycudf.Series
Series containing the categories to be encoded. It’s elements may or may not be unique
 _classes: int or None.
Passed by the dask client when dask LabelEncoder is used.
 Returns
 selfLabelEncoder
A fitted instance of itself to allow method chaining
fit_transform
(y: cudf.core.series.Series, z=None) → cudf.core.series.Series[source]¶Simultaneously fit and transform an input
This is functionally equivalent to (but faster than)
LabelEncoder().fit(y).transform(y)
get_param_names
(self)[source]¶Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it inturn owns. This is to simplify the implementation of
get_params
andset_params
methods.
inverse_transform
(y: cudf.core.series.Series) → cudf.core.series.Series[source]¶Revert ordinal label to original label
 Parameters
 ycudf.Series, dtype=int32
Ordinal labels to be reverted
 Returns
 revertedcudf.Series
Reverted labels
transform
(y: cudf.core.series.Series) → cudf.core.series.Series[source]¶Transform an input into its categorical keys.
This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer
fit_transform
.
 class
cuml.preprocessing.
LabelBinarizer
(*, neg_label=0, pos_label=1, sparse_output=False, handle=None, verbose=False, output_type=None)[source]¶A multiclass dummy encoder for labels.
 Parameters
 neg_labelinteger
label to be used as the negative binary label
 pos_labelinteger
label to be used as the positive binary label
 sparse_outputbool
whether to return sparse arrays for transformed output
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.Examples
Create an array with labels and dummy encode them
import cupy as cp import cupyx from cuml.preprocessing import LabelBinarizer labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1], dtype=cp.int32) lb = LabelBinarizer() encoded = lb.fit_transform(labels) print(str(encoded) decoded = lb.inverse_transform(encoded) print(str(decoded)Output:
[[1 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0] [0 0 1 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 1 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0] [0 1 0 0 0 0 0 0]] [ 0 5 10 7 2 4 1 0 0 4 3 2 1]
 Attributes
 classes_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.Methods
fit
(y)Fit label binarizer
Fit label binarizer and transform multiclass labels to their dummyencoded representation.
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
inverse_transform
(y[, threshold])Transform binary labels back to original multiclass labels
transform
(y)Transform multiclass labels to their dummyencoded representation labels.
fit
(y) → cuml.preprocessing.label.LabelBinarizer[source]¶Fit label binarizer
 Parameters
 yarray of shape [n_samples,] or [n_samples, n_classes]
Target values. The 2d matrix should only contain 0 and 1, represents multilabel classification.
 Returns
 selfreturns an instance of self.
fit_transform
(y) → cuml.common.array_sparse.SparseCumlArray[source]¶Fit label binarizer and transform multiclass labels to their dummyencoded representation.
 Parameters
 yarray of shape [n_samples,] or [n_samples, n_classes]
 Returns
 arrarray with encoded labels
get_param_names
(self)[source]¶Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it inturn owns. This is to simplify the implementation of
get_params
andset_params
methods.
cuml.preprocessing.
label_binarize
(y, classes, neg_label=0, pos_label=1, sparse_output=False) → cuml.common.array_sparse.SparseCumlArray[source]¶A stateless helper function to dummy encode multiclass labels.
 Parameters
 yarraylike of size [n_samples,] or [n_samples, n_classes]
 classesthe set of unique classes in the input
 neg_labelinteger the negative value for transformed output
 pos_labelinteger the positive value for transformed output
 sparse_outputbool whether to return sparse array
 class
cuml.preprocessing.
OneHotEncoder
(*, categories='auto', drop=None, sparse=True, dtype=<class 'float'>, handle_unknown='error', handle=None, verbose=False, output_type=None)[source]¶Encode categorical features as a onehot numeric array. The input to this estimator should be a cuDF.DataFrame or a cupy.ndarray, denoting the unique values taken on by categorical (discrete) features. The features are encoded using a onehot (aka ‘oneofK’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse
parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify thecategories
manually.Note
a onehot encoding of y labels should use a LabelBinarizer instead.
 Parameters
 categories‘auto’ an cupy.ndarray or a cudf.DataFrame, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]
holds the categories expected in the feature col. drop‘first’, None, a dict or a list, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
dict/list :
drop[col]
is the category in feature col that should be dropped. sparsebool, default=True
This feature is not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223
 dtypenumber type, default=np.float
Desired datatype of transform’s output.
 handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting onehot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. Attributes
 drop_idx_array of shape (n_features,)
drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature. None if all the transformed features will be retained.Methods
fit
(X)Fit OneHotEncoder to X.
fit_transform
(X[, y])Fit OneHotEncoder to X, then transform X.
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
Convert the data back to the original representation.
transform
(X)Transform X using onehot encoding.
 property
categories_
¶Returns categories used for the one hot encoding in the correct order.
fit
(X)[source]¶Fit OneHotEncoder to X.
 Parameters
 XcuDF.DataFrame or cupy.ndarray, shape = (n_samples, n_features)
The data to determine the categories of each feature.
 Returns
 self
fit_transform
(X, y=None)[source]¶Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).
 Parameters
 Xcudf.DataFrame or cupy.ndarray, shape = (n_samples, n_features)
The data to encode.
 Returns
 X_outsparse matrix if sparse=True else a 2d array
Transformed input.
get_param_names
(self)[source]¶Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it inturn owns. This is to simplify the implementation of
get_params
andset_params
methods.
inverse_transform
(X)[source]¶Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the onehot encoding),
None
is used to represent this category.The return type is the same as the type of the input used by the first call to fit on this estimator instance.
 Parameters
 Xarraylike or sparse matrix, shape [n_samples, n_encoded_features]
The transformed data.
 Returns
 X_trcudf.DataFrame or cupy.ndarray
Inverse transformed array.
 class
cuml.preprocessing.TargetEncoder.
TargetEncoder
(n_folds=4, smooth=0, seed=42, split_method='interleaved', output_type='auto')[source]¶A cudf based implementation of target encoding [1], which converts one or mulitple categorical variables, ‘Xs’, with the average of corresponding values of the target variable, ‘Y’. The input data is grouped by the columns
Xs
and the aggregated mean value ofY
of each group is calculated to replace each value ofXs
. Several optimizations are applied to prevent label leakage and parallelize the execution.
 Parameters
 n_foldsint (default=4)
Default number of folds for fitting training data. To prevent label leakage in
fit
, we split data inton_folds
and encode one fold using the target variables of the remaining folds. smoothfloat (default=0)
0 <= smooth <= 1 Percentage of samples to smooth the encoding
 seedint (default=42)
Random seed
 split_method{‘random’, ‘continuous’, ‘interleaved’},
default=’interleaved’ Method to split train data into
n_folds
. ‘random’: random split. ‘continuous’: consecutive samples are grouped into one folds. ‘interleaved’: samples are assign to each fold in a round robin way. output_type: {‘cupy’, ‘numpy’, ‘auto’}, default = ‘auto’
The data type of output. If ‘auto’, it matches input data.
References
Examples
Converting a categorical implementation to a numerical one
from cudf import DataFrame, Series train = DataFrame({'category': ['a', 'b', 'b', 'a'], 'label': [1, 0, 1, 1]}) test = DataFrame({'category': ['a', 'c', 'b', 'a']}) encoder = TargetEncoder() train_encoded = encoder.fit_transform(train.category, train.label) test_encoded = encoder.transform(test.category) print(train_encoded) print(test_encoded)Output:
[1. 1. 0. 1.] [1. 0.75 0.5 1. ]Methods
fit
(x, y)Fit a TargetEncoder instance to a set of categories
fit_transform
(x, y)Simultaneously fit and transform an input
transform
(x)Transform an input into its categorical keys.
fit
(x, y)[source]¶Fit a TargetEncoder instance to a set of categories
 Parameters
 x: cudf.Series or cudf.DataFrame or cupy.ndarray
categories to be encoded. It’s elements may or may not be unique
 ycudf.Series or cupy.ndarray
Series containing the target variable.
 Returns
 selfTargetEncoder
A fitted instance of itself to allow method chaining
fit_transform
(x, y)[source]¶Simultaneously fit and transform an input
This is functionally equivalent to (but faster than)
TargetEncoder().fit(y).transform(y)
transform
(x)[source]¶Transform an input into its categorical keys.
This is intended for test data. For fitting and transforming the training data, prefer
fit_transform
.
 Parameters
 xcudf.Series
Input keys to be transformed. Its values doesn’t have to match the categories given to
fit
 Returns
 encodedcupy.ndarray
The ordinally encoded input series
Text Preprocessing (SingleGPU)¶
 class
cuml.preprocessing.text.stem.
PorterStemmer
(mode='NLTK_EXTENSIONS')[source]¶A word stemmer based on the Porter stemming algorithm.
Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130137.
See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.
Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. Only below mode is supported currently PorterStemmer.NLTK_EXTENSIONS
Implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web.
 Parameters
 mode: Modes of stemming (Only supports (NLTK_EXTENSIONS) currently)
default(“NLTK_EXTENSIONS”)
Examples
import cudf from cuml.preprocessing.text.stem import PorterStemmer stemmer = PorterStemmer() word_str_ser = cudf.Series(['revival','singing','adjustable']) print(stemmer.stem(word_str_ser))Output:
0 reviv 1 sing 2 adjust dtype: objectMethods
stem
(word_str_ser)Stem Words using Porter stemmer
Feature and Label Encoding (Daskbased MultiGPU)¶
 class
cuml.dask.preprocessing.
LabelBinarizer
(*, client=None, **kwargs)[source]¶A distributed version of LabelBinarizer for onehot encoding a collection of labels.
Examples
Create an array with labels and dummy encode them
import cupy as cp import cupyx from cuml.dask.preprocessing import LabelBinarizer from dask_cuda import LocalCUDACluster from dask.distributed import Client import dask cluster = LocalCUDACluster() client = Client(cluster) labels = cp.asarray([0, 5, 10, 7, 2, 4, 1, 0, 0, 4, 3, 2, 1], dtype=cp.int32) labels = dask.array.from_array(labels) lb = LabelBinarizer() encoded = lb.fit_transform(labels) print(str(encoded.compute()) decoded = lb.inverse_transform(encoded) print(str(decoded.compute())Output:
[[1 0 0 0 0 0 0 0] [0 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0] [0 0 1 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 1 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0] [0 1 0 0 0 0 0 0]] [ 0 5 10 7 2 4 1 0 0 4 3 2 1]Methods
fit
(y)Fit label binarizer
Fit the label encoder and return transformed labels
inverse_transform
(y[, threshold])Invert a set of encoded labels back to original labels
transform
(y)Transform and return encoded labels
fit
(y)[source]¶Fit label binarizer
 Parameters
 yDask.Array of shape [n_samples,] or [n_samples, n_classes]
chunked by row. Target values. The 2d matrix should only contain 0 and 1, represents multilabel classification.
 Returns
 selfreturns an instance of self.
fit_transform
(y)[source]¶Fit the label encoder and return transformed labels
 Parameters
 yDask.Array of shape [n_samples,] or [n_samples, n_classes]
target values. The 2d matrix should only contain 0 and 1, represents multilabel classification.
 Returns
 arrDask.Array backed by CuPy arrays containing encoded labels
 class
cuml.dask.preprocessing.
OneHotEncoder
(*, client=None, verbose=False, **kwargs)[source]¶Encode categorical features as a onehot numeric array. The input to this transformer should be a dask_cuDF.DataFrame or cupy dask.Array, denoting the values taken on by categorical features. The features are encoded using a onehot (aka ‘oneofK’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse
parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify thecategories
manually.
 Parameters
 categories‘auto’, cupy.ndarray or cudf.DataFrame, default=’auto’
Categories (unique values) per feature. All categories are expected to fit on one GPU.
‘auto’ : Determine categories automatically from the training data.
DataFrame/ndarray :
categories[col]
holds the categories expected in the feature col. drop‘first’, None or a dict, default=None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
Dict :
drop[col]
is the category in feature col that should be dropped. sparsebool, default=False
This feature was deactivated and will give an exception when True. The reason is because sparse matrix are not fully supported by cupy yet, causing incorrect values when computing one hot encodings. See https://github.com/cupy/cupy/issues/3223
 dtypenumber type, default=np.float
Desired datatype of transform’s output.
 handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting onehot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
Methods
fit
(X)Fit a multinode multigpu OneHotEncoder to X.
fit_transform
(X[, delayed])Fit OneHotEncoder to X, then transform X.
inverse_transform
(X[, delayed])Convert the data back to the original representation.
transform
(X[, delayed])Transform X using onehot encoding.
fit
(X)[source]¶Fit a multinode multigpu OneHotEncoder to X.
 Parameters
 XDask cuDF DataFrame or CuPy backed Dask Array
The data to determine the categories of each feature.
 Returns
 self
fit_transform
(X, delayed=True)[source]¶Fit OneHotEncoder to X, then transform X. Equivalent to fit(X).transform(X).
 Parameters
 XDask cuDF DataFrame or CuPy backed Dask Array
The data to encode.
 delayedbool (default = True)
Whether to execute as a delayed task or eager.
 Returns
 outDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the transformed data
inverse_transform
(X, delayed=True)[source]¶Convert the data back to the original representation. In case unknown categories are encountered (all zeros in the onehot encoding),
None
is used to represent this category.
 Parameters
 XCuPy backed Dask Array, shape [n_samples, n_encoded_features]
The transformed data.
 delayedbool (default = True)
Whether to execute as a delayed task or eager.
 Returns
 X_trDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the inverse transformed array.
transform
(X, delayed=True)[source]¶Transform X using onehot encoding.
 Parameters
 XDask cuDF DataFrame or CuPy backed Dask Array
The data to encode.
 delayedbool (default = True)
Whether to execute as a delayed task or eager.
 Returns
 outDask cuDF DataFrame or CuPy backed Dask Array
Distributed object containing the transformed input.
Feature Extraction (SingleGPU)¶
 class
cuml.feature_extraction.text.
CountVectorizer
(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ')[source]¶Convert a collection of text documents to a matrix of token counts
If you do not provide an apriori dictionary then the number of features will be equal to the vocabulary size found by analyzing the data.
 Parameters
 lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
 preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and ngrams generation steps.
 stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a builtin stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
 ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of nvalues for different word ngrams or char ngrams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word ngram or character ngrams. Option ‘char_wb’ creates character ngrams only from text inside word boundaries; ngrams at the edges of words are padded with space.
 max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpusspecific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cutoff in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
 vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
 binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
 dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
 delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
 Attributes
 vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
 stop_words_cudf.Series[str]
 Terms that were ignored because they either:
occurred in too many documents (
max_df
)occurred in too few documents (
min_df
)were cut off by feature selection (
max_features
).This is only available if no vocabulary was given.
Methods
fit
(raw_documents)Build a vocabulary of all tokens in the raw documents.
fit_transform
(raw_documents)Build the vocabulary and return documentterm matrix.
Array mapping from feature integer indices to feature name.
Return terms per document with nonzero entries in X.
transform
(raw_documents)Transform documents to documentterm matrix.
fit
(raw_documents)[source]¶Build a vocabulary of all tokens in the raw documents.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 self
fit_transform
(raw_documents)[source]¶Build the vocabulary and return documentterm matrix.
Equivalent to
self.fit(X).transform(X)
but preprocessX
only once.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 Xcupy csr array of shape (n_samples, n_features)
Documentterm matrix.
get_feature_names
()[source]¶Array mapping from feature integer indices to feature name.
 Returns
 feature_namesSeries
A list of feature names.
inverse_transform
(X)[source]¶Return terms per document with nonzero entries in X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Documentterm matrix.
 Returns
 X_invlist of cudf.Series of shape (n_samples,)
List of Series of terms.
transform
(raw_documents)[source]¶Transform documents to documentterm matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 Xcupy csr array of shape (n_samples, n_features)
Documentterm matrix.
 class
cuml.feature_extraction.text.
HashingVectorizer
(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float32'>, delimiter=' ')[source]¶Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a cupyx.scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory which is even more important as GPU’s that are often memory constrained
it is fast to pickle and unpickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an inmemory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.
The hash function employed is the signed 32bit version of Murmurhash3.
 Parameters
 lowercasebool, default=True
Convert all characters to lowercase before tokenizing.
 preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and ngrams generation steps.
 stop_wordsstring {‘english’}, list, default=None
If ‘english’, a builtin stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'
. ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of nvalues for different word ngrams or char ngrams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word ngram or character ngrams. Option ‘char_wb’ creates character ngrams only from text inside word boundaries; ngrams at the edges of words are padded with space.
 n_featuresint, default=(2 ** 20)
The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
 binarybool, default=False.
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
 norm{‘l1’, ‘l2’}, default=’l2’
Norm used to normalize term vectors. None for no normalization.
 alternate_signbool, default=True
When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.
 dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
 delimiterstr, whitespace by default
String used as a replacement for stop words if
stop_words
is not None. Typically the delimiting character between words is a good choice.See also
Examples
from cuml.feature_extraction.text import HashingVectorizer corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = HashingVectorizer(n_features=2**4) X = vectorizer.fit_transform(corpus) print(X.shape)Output:
(4, 16)Methods
fit
(X[, y])This method only checks the input type and the model parameter.
fit_transform
(X[, y])Transform a sequence of documents to a documentterm matrix.
partial_fit
(X[, y])Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
transform
(raw_documents)Transform documents to documentterm matrix.
fit
(X, y=None)[source]¶This method only checks the input type and the model parameter. It does not do anything meaningful as this transformer is stateless
 Parameters
 Xcudf.Series
A Series of string documents
fit_transform
(X, y=None)[source]¶Transform a sequence of documents to a documentterm matrix.
 Parameters
 Xiterable over raw text documents, length = n_samples
Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Documentterm matrix.
partial_fit
(X, y=None)[source]¶Does nothing: This transformer is stateless This method is just there to mark the fact that this transformer can work in a streaming setup.
 Parameters
 Xcudf.Series(A Series of string documents).
transform
(raw_documents)[source]¶Transform documents to documentterm matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 Xsparse CuPy CSR matrix of shape (n_samples, n_features)
Documentterm matrix.
 class
cuml.feature_extraction.text.
TfidfVectorizer
(input=None, encoding=None, decode_error=None, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=None, ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float32'>, delimiter=' ', norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]¶Convert a collection of raw documents to a matrix of TFIDF features.
Equivalent to
CountVectorizer
followed byTfidfTransformer
.
 Parameters
 lowercaseboolean, True by default
Convert all characters to lowercase before tokenizing.
 preprocessorcallable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and ngrams generation steps.
 stop_wordsstring {‘english’}, list, or None (default)
If ‘english’, a builtin stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the input documents. If None, no stop words will be used. max_df can be set to a value to automatically detect and filter stop words based on intra corpus document frequency of terms.
 ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of nvalues for different word ngrams or char ngrams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. analyzerstring, {‘word’, ‘char’, ‘char_wb’}
Whether the feature should be made of word ngram or character ngrams. Option ‘char_wb’ creates character ngrams only from text inside word boundaries; ngrams at the edges of words are padded with space.
 max_dffloat in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpusspecific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 min_dffloat in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cutoff in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 max_featuresint or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.
 vocabularycudf.Series, optional
If not given, a vocabulary is determined from the input documents.
 binaryboolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
 dtypetype, optional
Type of the matrix returned by fit_transform() or transform().
 delimiterstr, whitespace by default
String used as a replacement for stop words if stop_words is not None. Typically the delimiting character between words is a good choice.
 norm{‘l1’, ‘l2’}, default=’l2’
 Each output row will have unit norm, either:
‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
‘l1’: Sum of absolute values of vector elements is 1.
 use_idfbool, default=True
Enable inversedocumentfrequency reweighting.
 smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
 sublinear_tfbool, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Notes
The
stop_words_
attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.This class is largely based on scikitlearn 0.23.1’s TfIdfVectorizer code, which is provided under the BSD3 license.
 Attributes
 idf_array of shape (n_features)
The inverse document frequency (IDF) vector; only defined if
use_idf
is True. vocabulary_cudf.Series[str]
Array mapping from feature integer indices to feature name.
 stop_words_cudf.Series[str]
 Terms that were ignored because they either:
occurred in too many documents (
max_df
)occurred in too few documents (
min_df
)were cut off by feature selection (
max_features
).This is only available if no vocabulary was given.
Methods
fit
(raw_documents)Learn vocabulary and idf from training set.
fit_transform
(raw_documents)Learn vocabulary and idf, return documentterm matrix.
transform
(raw_documents)Transform documents to documentterm matrix.
fit
(raw_documents)[source]¶Learn vocabulary and idf from training set.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 selfobject
Fitted vectorizer.
fit_transform
(raw_documents)[source]¶Learn vocabulary and idf, return documentterm matrix. This is equivalent to fit followed by transform, but more efficiently implemented.
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 Xcupy csr array of shape (n_samples, n_features)
Tfidfweighted documentterm matrix.
transform
(raw_documents)[source]¶Transform documents to documentterm matrix. Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
 Parameters
 raw_documentscudf.Series
A Series of string documents
 Returns
 Xcupy csr array of shape (n_samples, n_features)
Tfidfweighted documentterm matrix.
Feature Extraction (Daskbased MultiGPU)¶
 class
cuml.dask.feature_extraction.text.
TfidfTransformer
(*, client=None, verbose=False, **kwargs)[source]¶Distributed TFIDF transformer
Examples
import cupy as cp from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from dask_cuda import LocalCUDACluster from dask.distributed import Client from cuml.dask.common import to_sparse_dask_array from cuml.dask.naive_bayes import MultinomialNB import dask from cuml.dask.feature_extraction.text import TfidfTransformer # Create a local CUDA cluster cluster = LocalCUDACluster() client = Client(cluster) # Load corpus twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) cv = CountVectorizer() xformed = cv.fit_transform(twenty_train.data).astype(cp.float32) X = to_sparse_dask_array(xformed, client) y = dask.array.from_array(twenty_train.target, asarray=False, fancy=False).astype(cp.int32) mutli_gpu_transformer = TfidfTransformer() X_transormed = mutli_gpu_transformer.fit_transform(X) X_transormed.compute_chunk_sizes() model = MultinomialNB() model.fit(X_transormed, y) model.score(X_transormed, y)Output:
array(0.93264981)Methods
fit
(X)Fit distributed TFIDF Transformer
Fit distributed TFIDFTransformer and then transform the given set of data samples.
transform
(X)Use distributed TFIDFTransformer to transform the given set of data samples.
fit
(X)[source]¶Fit distributed TFIDF Transformer
 Parameters
 Xdask.Array with blocks containing dense or sparse cupy arrays
 Returns
 cuml.dask.feature_extraction.text.TfidfTransformer instance
Dataset Generation (SingleGPU)¶
 random_state¶
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
cuml.datasets.
make_blobs
(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=( 10.0, 10.0), shuffle=True, random_state=None, return_centers=False, order='F', dtype='float32')[source]¶Generate isotropic Gaussian blobs for clustering.
 Parameters
 n_samplesint or arraylike, optional (default=100)
If int, it is the total number of points equally divided among clusters. If arraylike, each element of the sequence indicates the number of samples per cluster.
 n_featuresint, optional (default=2)
The number of features for each sample.
 centersint or array of shape [
n_centers
,n_features
], optional(default=None) The number of centers to generate, or the fixed center locations. If
n_samples
is an int and centers is None, 3 centers are generated. Ifn_samples
is arraylike, centers must be either None or an array of length equal to the length ofn_samples
. cluster_stdfloat or sequence of floats, optional (default=1.0)
The standard deviation of the clusters.
 center_boxpair of floats (min, max), optional (default=(10.0, 10.0))
The bounding box for each cluster center when centers are generated at random.
 shuffleboolean, optional (default=True)
Shuffle the samples.
 random_stateint, RandomState instance, default=None
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
 return_centersbool, optional (default=False)
If True, then return the centers of each cluster
 order: str, optional (default=’F’)
The order of the generated samples
 dtypestr, optional (default=’float32’)
Dtype of the generated samples
 Returns
 Xdevice array of shape [n_samples, n_features]
The generated samples.
 ydevice array of shape [n_samples]
The integer labels for cluster membership of each sample.
 centersdevice array, shape [n_centers, n_features]
The centers of each cluster. Only returned if
return_centers=True
.See also
make_classification
a more intricate variant
Examples
>>> from sklearn.datasets import make_blobs >>> X, y = make_blobs(n_samples=10, centers=3, n_features=2, ... random_state=0) >>> print(X.shape) (10, 2) >>> y array([0, 0, 1, 0, 2, 2, 2, 1, 1, 0]) >>> X, y = make_blobs(n_samples=[3, 3, 4], centers=None, n_features=2, ... random_state=0) >>> print(X.shape) (10, 2) >>> y array([0, 1, 2, 0, 2, 2, 2, 1, 1, 0])
cuml.datasets.
make_classification
(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', _centroids=None, _informative_covariance=None, _redundant_covariance=None, _repeated_indices=None)[source]¶Generate a random nclass classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of an
n_informative
dimensional hypercube with sides of length2*class_sep
and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data. Without shuffling,X
horizontally stacks features in the following order: the primaryn_informative
features, followed byn_redundant
linear combinations of the informative features, followed byn_repeated
duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columnsX[:, :n_informative + n_redundant + n_repeated]
.
 Parameters
 n_samplesint, optional (default=100)
The number of samples.
 n_featuresint, optional (default=20)
The total number of features. These comprise
n_informative
informative features,n_redundant
redundant features,n_repeated
duplicated features andn_featuresn_informativen_redundantn_repeated
useless features drawn at random. n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension
n_informative
. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube. n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.
 n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.
 n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.
 n_clusters_per_classint, optional (default=2)
The number of clusters per class.
 weightsarraylike of shape (n_classes,) or (n_classes  1,), (default=None)
The proportions of samples assigned to each class. If None, then classes are balanced. Note that if
len(weights) == n_classes  1
, then the last class weight is automatically inferred. More thann_samples
samples may be returned if the sum ofweights
exceeds 1. flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.
 class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.
 hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.
 shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [class_sep, class_sep].
 scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.
 shuffleboolean, optional (default=True)
Shuffle the samples and the features.
 random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
 order: str, optional (default=’F’)
The order of the generated samples
 dtypestr, optional (default=’float32’)
Dtype of the generated samples
 _centroids: array of centroids of shape (n_clusters, n_informative)
 _informative_covariance: array for covariance between informative features
of shape (n_clusters, n_informative, n_informative)
 _redundant_covariance: array for covariance between redundant features
of shape (n_informative, n_redundant)
 _repeated_indices: array of indices for the repeated features
of shape (n_repeated, )
 Returns
 Xdevice array of shape [n_samples, n_features]
The generated samples.
 ydevice array of shape [n_samples]
The integer labels for class membership of each sample.
Notes
The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset. How we optimized for GPUs:
Firstly, we generate X from a standard univariate instead of zeros. This saves memory as we don’t need to generate univariates each time for each feature class (informative, repeated, etc.) while also providing the added speedup of generating a big matrix on GPU
We generate
order=F
construction. We exploit the fact that X is a generated from a univariate normal, and covariance is introduced with matrix multiplications. Which means, we can generate X as a 1D array and just reshape it to the desired order, which only updates the metadata and eliminates copiesLastly, we also shuffle by construction. Centroid indices are permuted for each sample, and then we construct the data for each centroid. This shuffle works for both
order=C
andorder=F
and eliminates any need for secondary copiesReferences
 1
I. Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.
Examples
from cuml.datasets.classification import make_classification X, y = make_classification(n_samples=10, n_features=4, n_informative=2, n_classes=2) print("X:") print(X) print("y:") print(y)Output:
X: [[2.3249989 0.8679415 1.1511791 1.3525577 ] [ 2.2933831 1.3743551 0.63128835 0.84648645] [ 1.6361488 1.3233329 0.807027 0.894092 ] [1.0093077 0.9990691 0.00808992 0.00950443] [ 0.99803793 2.068382 0.49570698 0.8462848 ] [1.2750955 0.9725835 0.2390058 0.28081596] [1.3635055 0.9637669 0.31582272 0.37106958] [ 1.1893625 2.227583 0.48750278 0.8737561 ] [0.05753583 1.0939395 0.8188342 0.9620734 ] [ 0.47910076 0.7648213 0.17165393 0.26144698]] y: [0 1 0 0 1 0 0 1 0 1]
cuml.datasets.
make_regression
(n_samples=100, n_features=2, n_informative=2, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None, dtype='single', handle=None) → typing.Union[typing.Tuple[CumlArray, CumlArray], typing.Tuple[CumlArray, CumlArray, CumlArray]][source]¶Generate a random regression problem.
See https://scikitlearn.org/stable/modules/generated/sklearn.datasets.make_regression.html # noqa: E501
 Parameters
 n_samplesint, optional (default=100)
The number of samples.
 n_featuresint, optional (default=2)
The number of features.
 n_informativeint, optional (default=2)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.
 n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.
 biasfloat, optional (default=0.0)
The bias term in the underlying linear model.
 effective_rankint or None, optional (default=None)
 if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.
 if None:
The input set is well conditioned, centered and gaussian with unit variance.
 tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if
effective_rank
is not None. noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
 shuffleboolean, optional (default=True)
Shuffle the samples and the features.
 coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.
 random_stateint, RandomState instance or None (default)
Seed for the random number generator for dataset creation.
 dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’.
 handle: cuml.Handle
If it is None, a new one is created just for this function call
 Returns
 outdevice array of shape [n_samples, n_features]
The input samples.
 valuesdevice array of shape [n_samples, n_targets]
The output values.
 coefdevice array of shape [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.
Examples
from cuml.datasets.regression import make_regression from cuml.linear_model import LinearRegression # Create regression problem data, values = make_regression(n_samples=200, n_features=12, n_informative=7, bias=4.2, noise=0.3) # Perform a linear regression on this problem lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig") reg = lr.fit(data, values) print(reg.coef_)
cuml.datasets.
make_arima
(batch_size=1000, n_obs=100, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0), intercept=False, random_state=None, dtype='double', output_type='cupy', handle=None)[source]¶Generates a dataset of time series by simulating an ARIMA process of a given order.
 Parameters
 batch_size: int
Number of time series to generate
 n_obs: int
Number of observations per series
 orderTuple[int, int, int]
Order (p, d, q) of the simulated ARIMA process
 seasonal_order: Tuple[int, int, int, int]
Seasonal ARIMA order (P, D, Q, s) of the simulated ARIMA process
 intercept: bool or int
Whether to include a constant trend mu in the simulated ARIMA process
 random_state: int, RandomState instance or None (default)
Seed for the random number generator for dataset creation.
 dtype: string or numpy dtype (default: ‘single’)
Type of the data. Possible values: float32, float64, ‘single’, ‘float’ or ‘double’
 output_type: {‘cudf’, ‘cupy’, ‘numpy’}
Type of the returned dataset
Deprecated since version 0.17:
output_type
is deprecated in 0.17 and will be removed in 0.18. Please use the module level output type control,cuml.global_settings.output_type
. See Output Data Type Configuration for more info. handle: cuml.Handle
If it is None, a new one is created just for this function call
 Returns
 out: arraylike, shape (n_obs, batch_size)
Array of the requested type containing the generated dataset
Examples
from cuml.datasets import make_arima y = make_arima(1000, 100, (2,1,2), (0,1,2,12), 0)
Dataset Generation (Daskbased MultiGPU)¶
cuml.dask.datasets.blobs.
make_blobs
(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=( 10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None)[source]¶Makes labeled DaskCupy arrays containing blobs for a randomly generated set of centroids.
This function calls
make_blobs
fromcuml.datasets
on each Dask worker and aggregates them into a single Dask Dataframe.For more information on Scikitlearn’s make_blobs:.
 Parameters
 n_samplesint
number of rows
 n_featuresint
number of features
 centersint or array of shape [n_centers, n_features],
optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is arraylike, centers must be either None or an array of length equal to the length of n_samples.
 cluster_stdfloat (default = 1.0)
standard deviation of points around centroid
 n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)
 center_boxtuple (int, int) (default = (10, 10))
the bounding box which constrains all the centroids
 random_stateint (default = None)
sets random seed (or use None to reinitialize each time)
 return_centersbool, optional (default=False)
If True, then return the centers of each cluster
 verboseint or boolean (default = False)
Logging level.
 shufflebool (default=False)
Shuffles the samples on each worker.
 order: str, optional (default=’F’)
The order of the generated samples
 dtypestr, optional (default=’float32’)
Dtype of the generated samples
 clientdask.distributed.Client (optional)
Dask client to use
 Returns
 Xdask.array backed by CuPy array of shape [n_samples, n_features]
The input samples.
 ydask.array backed by CuPy array of shape [n_samples]
The output values.
 centersdask.array backed by CuPy array of shape
[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.
cuml.dask.datasets.classification.
make_classification
(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None, order='F', dtype='float32', n_parts=None, client=None)[source]¶Generate a random nclass classification problem.
This initially creates clusters of points normally distributed (std=1) about vertices of an
n_informative
dimensional hypercube with sides of length2 * class_sep
and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.Without shuffling,
X
horizontally stacks features in the following order: the primaryn_informative
features, followed byn_redundant
linear combinations of the informative features, followed byn_repeated
duplicates, drawn randomly with replacement from the informative and redundant features. The remaining features are filled with random noise. Thus, without shuffling, all useful features are contained in the columnsX[:, :n_informative + n_redundant + n_repeated]
.
 Parameters
 n_samplesint, optional (default=100)
The number of samples.
 n_featuresint, optional (default=20)
The total number of features. These comprise
n_informative
informative features,n_redundant
redundant features,n_repeated
duplicated features andn_featuresn_informativen_redundantn_repeated
useless features drawn at random. n_informativeint, optional (default=2)
The number of informative features. Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension
n_informative
. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube. n_redundantint, optional (default=2)
The number of redundant features. These features are generated as random linear combinations of the informative features.
 n_repeatedint, optional (default=0)
The number of duplicated features, drawn randomly from the informative and the redundant features.
 n_classesint, optional (default=2)
The number of classes (or labels) of the classification problem.
 n_clusters_per_classint, optional (default=2)
The number of clusters per class.
 weightsarraylike of shape
(n_classes,)
or(n_classes  1,)
, (default=None)The proportions of samples assigned to each class. If None, then classes are balanced. Note that if
len(weights) == n_classes  1
, then the last class weight is automatically inferred. More thann_samples
samples may be returned if the sum ofweights
exceeds 1. flip_yfloat, optional (default=0.01)
The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.
 class_sepfloat, optional (default=1.0)
The factor multiplying the hypercube size. Larger values spread out the clusters/classes and make the classification task easier.
 hypercubeboolean, optional (default=True)
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a random polytope.
 shiftfloat, array of shape [n_features] or None, optional (default=0.0)
Shift features by the specified value. If None, then features are shifted by a random value drawn in [class_sep, class_sep].
 scalefloat, array of shape [n_features] or None, optional (default=1.0)
Multiply features by the specified value. If None, then features are scaled by a random value drawn in [1, 100]. Note that scaling happens after shifting.
 shuffleboolean, optional (default=True)
Shuffle the samples and the features.
 random_stateint, RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.
 order: str, optional (default=’F’)
The order of the generated samples
 dtypestr, optional (default=’float32’)
Dtype of the generated samples
 n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)
 Returns
 Xdask.array backed by CuPy array of shape [n_samples, n_features]
The generated samples.
 ydask.array backed by CuPy array of shape [n_samples]
The integer labels for class membership of each sample.
Notes
How we extended the dask MNMG version from the single GPU version:
We generate centroids of shape
(n_centroids, n_informative)
We generate an informative covariance of shape
(n_centroids, n_informative, n_informative)
We generate a redundant covariance of shape
(n_informative, n_redundant)
We generate the indices for the repeated features We pass along the references to the futures of the above arrays with each part to the single GPU
cuml.datasets.classification.make_classification
so that each part (and worker) has access to the correct values to generate data from the same covariancesExamples
from dask.distributed import Client from dask_cuda import LocalCUDACluster from cuml.dask.datasets.classification import make_classification cluster = LocalCUDACluster() client = Client(cluster) X, y = make_classification(n_samples=10, n_features=4, n_informative=2, n_classes=2) print("X:") print(X.compute()) print("y:") print(y.compute())Output:
X: [[1.6990056 0.8241044 0.06997631 0.45107925] [1.8105277 1.7829906 0.492909 0.05390119] [0.18290454 0.6155432 0.6667889 1.0053712 ] [2.7530136 0.888528 0.5023055 1.3983376 ] [0.9788184 0.89851004 0.10802134 0.10021686] [0.76883423 1.0689086 0.01249526 0.1404741 ] [1.5676656 0.83082974 0.03072987 0.34499463] [0.9381793 1.0971068 0.07465998 0.02618019] [1.3021476 0.87076336 0.02249984 0.15187258] [ 1.1820307 1.7524253 1.5087451 2.4626074 ]] y: [0 1 0 0 0 0 0 0 0 1]
cuml.dask.datasets.regression.
make_low_rank_matrix
(n_samples=100, n_features=100, effective_rank=10, tail_strength=0.5, random_state=None, n_parts=1, n_samples_per_part=None, dtype='float32')[source]¶Generate a mostly low rank matrix with bellshaped singular values
 Parameters
 n_samplesint, optional (default=100)
The number of samples.
 n_featuresint, optional (default=100)
The number of features.
 effective_rankint, optional (default=10)
The approximate number of singular vectors required to explain most of the data by linear combinations.
 tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile.
 random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
 n_partsint, optional (default=1)
The number of parts of work.
 dtype: str, optional (default=’float32’)
dtype of generated data
 Returns
 XDaskCuPy array of shape [n_samples, n_features]
The matrix.
cuml.dask.datasets.regression.
make_regression
(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=False, coef=False, random_state=None, n_parts=1, n_samples_per_part=None, order='F', dtype='float32', client=None, use_full_low_rank=True)[source]¶Generate a random regression problem.
The input set can either be well conditioned (by default) or have a low rankfat tail singular profile.
The output is generated by applying a (potentially biased) random linear regression model with “n_informative” nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.
 Parameters
 n_samplesint, optional (default=100)
The number of samples.
 n_featuresint, optional (default=100)
The number of features.
 n_informativeint, optional (default=10)
The number of informative features, i.e., the number of features used to build the linear model used to generate the output.
 n_targetsint, optional (default=1)
The number of regression targets, i.e., the dimension of the y output vector associated with a sample. By default, the output is a scalar.
 biasfloat, optional (default=0.0)
The bias term in the underlying linear model.
 effective_rankint or None, optional (default=None)
 if not None:
The approximate number of singular vectors required to explain most of the input data by linear combinations. Using this kind of singular spectrum in the input allows the generator to reproduce the correlations often observed in practice.
 if None:
The input set is well conditioned, centered and gaussian with unit variance.
 tail_strengthfloat between 0.0 and 1.0, optional (default=0.5)
The relative importance of the fat noisy tail of the singular values profile if “effective_rank” is not None.
 noisefloat, optional (default=0.0)
The standard deviation of the gaussian noise applied to the output.
 shuffleboolean, optional (default=False)
Shuffle the samples and the features.
 coefboolean, optional (default=False)
If True, the coefficients of the underlying linear model are returned.
 random_stateint, CuPy RandomState instance, Dask RandomState instance or None (default)
Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls.
 n_partsint, optional (default=1)
The number of parts of work.
 orderstr, optional (default=’F’)
Rowmajor or Colmajor
 dtype: str, optional (default=’float32’)
dtype of generated data
 use_full_low_rankboolean (default=True)
Whether to use the entire dataset to generate the low rank matrix. If False, it creates a low rank covariance and uses the corresponding covariance to generate a multivariate normal distribution on the remaining chunks
 Returns
 XDaskCuPy array of shape [n_samples, n_features]
The input samples.
 yDaskCuPy array of shape [n_samples] or [n_samples, n_targets]
The output values.
 coefDaskCuPy array of shape [n_features] or [n_features, n_targets], optional
The coefficient of the underlying linear model. It is returned only if coef is True.
Notes
 Known Performance Limitations:
When
effective_rank
is set anduse_full_low_rank
is True, we cannot generate orderF
by construction, and an explicit transpose is performed on each part. This may cause memory to spike (other parameters make orderF
by construction)When
n_targets > 1
andorder = 'F'
as above, we have to explicity transpose they
array. Ifcoef = True
, then we also explicity transpose theground_truth
arrayWhen
shuffle = True
andorder = F
, there are memory spikes to shuffle theF
order arraysNote
If outofmemory errors are encountered in any of the above configurations, try increasing the
n_parts
parameter.
Array Wrappers (Internal API)¶

class
cuml.common.
CumlArray
(data=None, owner=None, dtype=None, shape=None, order=None)[source]¶ Array represents an abstracted array allocation. It can be instantiated by itself, creating an rmm.DeviceBuffer underneath, or can be instantiated by
__cuda_array_interface__
or__array_interface__
compliant arrays, in which case it’ll keep a reference to that data underneath. Also can be created from a pointer, specifying the characteristics of the array, in that case the owner of the data referred to by the pointer should be specified explicitly. Parameters
 datarmm.DeviceBuffer, cudf.Buffer, array_like, int, bytes, bytearray or memoryview
An arraylike object or integer representing a device or host pointer to preallocated memory.
 ownerobject, optional
Python object to which the lifetime of the memory allocation is tied. If provided, a reference to this object is kept in this Buffer.
 dtypedatatype, optional
Any object that can be interpreted as a numpy or cupy data type.
 shapeint or tuple of ints, optional
Shape of created array.
 order: string, optional
Whether to create a Fmajor or Cmajor array.
Notes
cuml Array is not meant as an enduser array library. It is meant for cuML/RAPIDS developer consumption. Therefore it contains the minimum functionality. Its functionality is hidden by base.pyx to provide automatic output format conversion so that the users see the important attributes in whatever format they prefer.
Todo: support cuda streams in the constructor. See: https://github.com/rapidsai/cuml/issues/1712 https://github.com/rapidsai/cuml/pull/1396
 Attributes
 ptrint
Pointer to the data
 sizeint
Size of the array data in bytes
 _ownerPython Object
Object that owns the data of the array
 shapetuple of ints
Shape of the array
 order{‘F’, ‘C’}
‘F’ or ‘C’ to indicate Fortranmajor or Cmajor order of the array
 stridestuple of ints
Strides of the data
 __cuda_array_interface__dictionary
__cuda_array_interface__
to interop with other libraries.
Methods
empty
(shape, dtype[, order])Create an empty Array with an allocated but uninitialized DeviceBuffer
full
(shape, value, dtype[, order])Create an Array with an allocated DeviceBuffer initialized to value.
ones
(shape[, dtype, order])Create an Array with an allocated DeviceBuffer initialized to zeros.
to_output
([output_type, output_dtype])Convert array to output format
zeros
(shape[, dtype, order])Create an Array with an allocated DeviceBuffer initialized to zeros.
item
serialize

classmethod
empty
(shape, dtype, order='F')[source]¶ Create an empty Array with an allocated but uninitialized DeviceBuffer
 Parameters
 dtypedatatype, optional
Any object that can be interpreted as a numpy or cupy data type.
 shapeint or tuple of ints, optional
Shape of created array.
 order: string, optional
Whether to create a Fmajor or Cmajor array.

classmethod
full
(shape, value, dtype, order='F')[source]¶ Create an Array with an allocated DeviceBuffer initialized to value.
 Parameters
 dtypedatatype, optional
Any object that can be interpreted as a numpy or cupy data type.
 shapeint or tuple of ints, optional
Shape of created array.
 order: string, optional
Whether to create a Fmajor or Cmajor array.

classmethod
ones
(shape, dtype='float32', order='F')[source]¶ Create an Array with an allocated DeviceBuffer initialized to zeros.
 Parameters
 dtypedatatype, optional
Any object that can be interpreted as a numpy or cupy data type.
 shapeint or tuple of ints, optional
Shape of created array.
 order: string, optional
Whether to create a Fmajor or Cmajor array.

to_output
(output_type='cupy', output_dtype=None)[source]¶ Convert array to output format
 Parameters
 output_typestring
Format to convert the array to. Acceptable formats are:
‘cupy’  to cupy array
‘numpy’  to numpy (host) array
‘numba’  to numba device array
‘dataframe’  to cuDF DataFrame
‘series’  to cuDF Series
 ‘cudf’  to cuDF Series if array is single dimensional, to
DataFrame otherwise
 output_dtypestring, optional
Optionally cast the array to a specified dtype, creating a copy if necessary.

classmethod
zeros
(shape, dtype='float32', order='F')[source]¶ Create an Array with an allocated DeviceBuffer initialized to zeros.
 Parameters
 dtypedatatype, optional
Any object that can be interpreted as a numpy or cupy data type.
 shapeint or tuple of ints, optional
Shape of created array.
 order: string, optional
Whether to create a Fmajor or Cmajor array.
Metrics (regression, classification, and distance)¶
cuml.metrics.regression.
mean_absolute_error
(y_true, y_pred, sample_weight=None, multioutput='uniform_average')[source]¶Mean absolute error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
 Parameters
 y_truearraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
 y_predarraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
 sample_weightarraylike (device or host) shape = (n_samples,), optional
Sample weights.
 multioutputstring in [‘raw_values’, ‘uniform_average’]
or arraylike of shape (n_outputs) Defines aggregating of multiple output values. Arraylike value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
 Returns
 lossfloat or ndarray of floats
If multioutput is ‘raw_values’, then mean absolute error is returned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of weights, then the weighted average of all output errors is returned.
MAE output is nonnegative floating point. The best value is 0.0.
cuml.metrics.regression.
mean_squared_error
(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]¶Mean squared error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
 Parameters
 y_truearraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
 y_predarraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
 sample_weightarraylike (device or host) shape = (n_samples,), optional
Sample weights.
 multioutputstring in [‘raw_values’, ‘uniform_average’]
or arraylike of shape (n_outputs) Defines aggregating of multiple output values. Arraylike value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
 squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.
 Returns
 lossfloat or ndarray of floats
A nonnegative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
cuml.metrics.regression.
mean_squared_log_error
(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]¶Mean squared log error regression loss
Be careful when using this metric with float32 inputs as the result can be slightly incorrect because of floating point precision if the input is large enough. float64 will have lower numerical error.
 Parameters
 y_truearraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
 y_predarraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
 sample_weightarraylike (device or host) shape = (n_samples,), optional
Sample weights.
 multioutputstring in [‘raw_values’, ‘uniform_average’]
or arraylike of shape (n_outputs) Defines aggregating of multiple output values. Arraylike value defines weights used to average errors. ‘raw_values’ : Returns a full set of errors in case of multioutput input. ‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
 squaredboolean value, optional (default = True)
If True returns MSE value, if False returns RMSE value.
 Returns
 lossfloat or ndarray of floats
A nonnegative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
cuml.metrics.regression.
r2_score
(y, y_hat, convert_dtype=True, handle=None) → double[source]¶Calculates r2 score between y and y_hat
 Parameters
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 y_hatarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 convert_dtypebool, optional (default = False)
When set to True, the fit method will, when necessary, convert y_hat to be the same data type as y if they differ. This will increase memory used for the method.
 Returns
 trustworthiness scoredouble
Trustworthiness of the lowdimensional embedding
cuml.metrics.accuracy.
accuracy_score
(ground_truth, predictions, handle=None, convert_dtype=True)[source]¶Calcuates the accuracy score of a classification model.
 Parameters
 handlecuml.Handle
 predictionNumPy ndarray or Numba device
The labels predicted by the model for the test dataset
 ground_truthNumPy ndarray, Numba device
The ground truth labels of the test dataset
 Returns
 float
The accuracy of the model used for prediction
cuml.metrics.
confusion_matrix
(y_true, y_pred, labels=None, sample_weight=None, normalize=None) → cuml.common.array.CumlArray[source]¶Compute confusion matrix to evaluate the accuracy of a classification.
 Parameters
 y_truearraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Ground truth (correct) target values.
 y_predarraylike (device or host) shape = (n_samples,)
or (n_samples, n_outputs) Estimated target values.
 labelsarraylike (device or host) shape = (n_classes,), optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.
 sample_weightarraylike (device or host) shape = (n_samples,), optional
Sample weights.
 normalizestring in [‘true’, ‘pred’, ‘all’]
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized.
 Returns
 Carraylike (device or host) shape = (n_classes, n_classes)
Confusion matrix.
cuml.metrics.
log_loss
(y_true, y_pred, eps=1e15, normalize=True, sample_weight=None) → float[source]¶Log loss, aka logistic loss or crossentropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative loglikelihood of a logistic model that returns
y_pred
probabilities for its training datay_true
. The log loss is only defined for two or more labels.
 Parameters
 y_truearraylike, shape = (n_samples,)
 y_predarraylike of float,
shape = (n_samples, n_classes) or (n_samples,)
 epsfloat
Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1  eps, p)).
 normalizebool, optional (default=True)
If true, return the mean loss per sample. Otherwise, return the sum of the persample losses.
 sample_weightarraylike of shape (n_samples,), default=None
Sample weights.
 Returns
 lossfloat
Notes
The logarithm used is the natural logarithm (basee).
References
C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, p. 209.
Examples
>>> from cuml.metrics import log_loss >>> import numpy as np >>> log_loss(np.array([1, 0, 0, 1]), ... np.array([[.1, .9], [.9, .1], [.8, .2], [.35, .65]])) 0.21616...
cuml.metrics.
roc_auc_score
(y_true, y_score)[source]¶Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
Note
this implementation can only be used with binary classification.
 Parameters
 y_truearraylike of shape (n_samples,)
True labels. The binary cases expect labels with shape (n_samples,)
 y_scorearraylike of shape (n_samples,)
Target scores. In the binary cases, these can be either probability estimates or nonthresholded decision values (as returned by
decision_function
on some classifiers). The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label. Returns
 aucfloat
Examples
>>> import numpy as np >>> from cuml.metrics import roc_auc_score >>> y_true = np.array([0, 0, 1, 1]) >>> y_scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> print(roc_auc_score(y_true, y_scores)) 0.75
cuml.metrics.
precision_recall_curve
(y_true, probs_pred) → Tuple[cuml.common.array.CumlArray, cuml.common.array.CumlArray, cuml.common.array.CumlArray][source]¶Compute precisionrecall pairs for different probability thresholds
Note
this implementation is restricted to the binary classification task. The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis.Read more in the scikitlearn’s User Guide.
 Parameters
 y_truearray, shape = [n_samples]
True binary labels, {0, 1}.
 probas_predarray, shape = [n_samples]
Estimated probabilities or decision function.
 Returns
 precisionarray, shape = [n_thresholds + 1]
Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.
 recallarray, shape = [n_thresholds + 1]
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.
 thresholdsarray, shape = [n_thresholds <= len(np.unique(probas_pred))]
Increasing thresholds on the decision function used to compute precision and recall.
Examples
import numpy as np from cuml.metrics import precision_recall_curve y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) precision, recall, thresholds = precision_recall_curve( y_true, y_scores) print(precision) print(recall) print(thresholds)Output:
array([0.66666667, 0.5 , 1. , 1. ]) array([1. , 0.5, 0.5, 0. ]) array([0.35, 0.4 , 0.8 ])
cuml.metrics.pairwise_distances.
pairwise_distances
(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, output_type=None, **kwds)[source]¶Compute the distance matrix from a vector array
X
and optionalY
.This method takes either one or two vector arrays, and returns a distance matrix.
If
Y
is given (default isNone
), then the returned matrix is the pairwise distance between the arrays from bothX
andY
.Valid values for metric are:
 From scikitlearn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
Sparse matrices are supported, see ‘sparse_pairwise_distances’.
 From scipy.spatial.distance: [‘sqeuclidean’]
See the documentation for scipy.spatial.distance for details on this metric. Sparse matrices are supported.
 Parameters
 XDense or sparse matrix (device or host) of shape
(n_samples_x, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy, or cupyx.scipy.sparse for sparse input
 Yarraylike (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”}
The metric to use when calculating distance between instances in a feature array.
 convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.Deprecated since version 0.17:
output_type
is deprecated in 0.17 and will be removed in 0.18. Please use the module level output type control,cuml.global_settings.output_type
. See Output Data Type Configuration for more info. Returns
 Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix
X
, ifY
is None. IfY
is notNone
, then D_{i, j} is the distance between the ith array fromX
and the jth array fromY
.Examples
>>> import cupy as cp >>> from cuml.metrics import pairwise_distances >>> >>> X = cp.array([[2.0, 3.0], [3.0, 5.0], [5.0, 8.0]]) >>> Y = cp.array([[1.0, 0.0], [2.0, 1.0]]) >>> >>> # Euclidean Pairwise Distance, Single Input: >>> pairwise_distances(X, metric='euclidean') array([[0. , 2.23606798, 5.83095189], [2.23606798, 0. , 3.60555128], [5.83095189, 3.60555128, 0. ]]) >>> >>> # Cosine Pairwise Distance, MultiInput: >>> pairwise_distances(X, Y, metric='cosine') array([[0.4452998 , 0.13175686], [0.48550424, 0.15633851], [0.47000106, 0.14671817]]) >>> >>> # Manhattan Pairwise Distance, MultiInput: >>> pairwise_distances(X, Y, metric='manhattan') array([[ 4., 2.], [ 7., 5.], [12., 10.]])
cuml.metrics.pairwise_distances.
sparse_pairwise_distances
(X, Y=None, metric='euclidean', handle=None, convert_dtype=True, metric_arg=2, **kwds)[source]¶Compute the distance matrix from a vector array
X
and optionalY
.This method takes either one or two sparse vector arrays, and returns a dense distance matrix.
If
Y
is given (default isNone
), then the returned matrix is the pairwise distance between the arrays from bothX
andY
.Valid values for metric are:
From scikitlearn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
 From scipy.spatial.distance: [‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘jaccard’, ‘chebyshev’, ‘dice’]
See the documentation for scipy.spatial.distance for details on these metrics.
[‘inner_product’, ‘hellinger’]
 Xarraylike (device or host) of shape (n_samples_x, n_features)
Acceptable formats: SciPy or Cupy sparse array
 Yarraylike (device or host) of shape (n_samples_y, n_features), optional
Acceptable formats: SciPy or Cupy sparse array
 metric{“cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “sqeuclidean”, “canberra”, “lp”, “inner_product”, “minkowski”, “jaccard”, “hellinger”, “chebyshev”, “linf”, “dice”}
The metric to use when calculating distance between instances in a feature array.
 convert_dtypebool, optional (default = True)
When set to True, the method will, when necessary, convert Y to be the same data type as X if they differ. This will increase memory used for the method.
 metric_argfloat, optional (default = 2)
Additionnal metricspecific argument. For Minkowski it’s the pnorm to apply.
 Returns
 Darray [n_samples_x, n_samples_x] or [n_samples_x, n_samples_y]
A dense distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix
X
, ifY
is None. IfY
is notNone
, then D_{i, j} is the distance between the ith array fromX
and the jth array fromY
.Examples
>>> import cupyx >>> from cuml.metrics import sparse_pairwise_distances >>> >>> X = cupyx.scipy.sparse.random(2, 3, density=0.5) >>> Y = cupyx.scipy.sparse.random(1, 3, density=0.5) >>> X.todense() array([[0.02797998, 0. , 0.66309184], [0. , 0. , 0.92316052]]) >>> Y.todense() array([[0. , 0. , 0.32750517]]) >>> # Cosine Pairwise Distance, Single Input: >>> sparse_pairwise_distances(X, metric='cosine') array([[0. , 0.00088907], [0.00088907, 0. ]]) >>> >>> # Squared euclidean Pairwise Distance, MultiInput: >>> sparse_pairwise_distances(X, Y, metric='sqeuclidean') array([[0.11340129], [0.3548053]]) >>> >>> # Canberra Pairwise Distance, MultiInput: >>> sparse_pairwise_distances(X, Y, metric='canberra') array([[1.33877214], [0.47627064]])
Metrics (clustering and trustworthiness)¶
cuml.metrics.trustworthiness.
trustworthiness
(X, X_embedded, handle=None, n_neighbors=5, metric='euclidean', should_downcast=True, convert_dtype=False, batch_size=512) → double[source]¶Expresses to what extent the local structure is retained in embedding. The score is defined in the range [0, 1].
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 X_embeddedarraylike (device or host) shape= (n_samples, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 n_neighborsint, optional (default: 5)
Number of neighbors considered
 convert_dtypebool, optional (default = False)
When set to True, the trustworthiness method will automatically convert the inputs to np.float32.
 Returns
 trustworthiness scoredouble
Trustworthiness of the lowdimensional embedding
cuml.metrics.cluster.adjusted_rand_index.
adjusted_rand_score
(labels_true, labels_pred, handle=None, convert_dtype=True) → float[source]¶Adjusted_rand_score is a clustering similarity metric based on the Rand index and is corrected for chance.
 Parameters
 labels_trueGround truth labels to be used as a reference
labels_pred : Array of predicted labels used to evaluate the model
handle : cuml.Handle
 Returns
 float
The adjusted rand index value between 1.0 and 1.0
cuml.metrics.cluster.entropy.
cython_entropy
(clustering, base=None, handle=None) → float[source]¶Computes the entropy of a distribution for given probability values.
 Parameters
 clusteringarraylike (device or host) shape = (n_samples,)
Clustering of labels. Probabilities are computed based on occurrences of labels. For instance, to represent a fair coin (2 equally possible outcomes), the clustering could be [0,1]. For a biased coin with 2/3 probability for tail, the clustering could be [0, 0, 1].
 base: float, optional
The logarithmic base to use, defaults to e (natural logarithm).
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 Returns
 Sfloat
The calculated entropy.
cuml.metrics.cluster.homogeneity_score.
cython_homogeneity_score
(labels_true, labels_pred, handle=None) → float[source]¶Computes the homogeneity metric of a cluster labeling given a ground truth.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
 Parameters
 labels_predarraylike (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 labels_truearraylike (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 Returns
 float
The homogeneity of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling.
cuml.metrics.cluster.silhouette_score.
cython_silhouette_samples
(X, labels, metric='euclidean', chunksize=None, handle=None)[source]¶Calculate the silhouette coefficient for each sample in the provided data
Given a set of cluster labels for every sample in the provided data, compute the mean intracluster distance (a) and the mean nearestcluster distance (b) for each sample. The silhouette coefficient for a sample is then (b  a) / max(a, b).
 Parameters
 Xarraylike, shape = (n_samples, n_features)
The feature vectors for all samples.
 labelsarraylike, shape = (n_samples,)
The assigned cluster labels for each sample.
 metricstring
A string representation of the distance metric to use for evaluating the silhouette schore. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.
 chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_rows to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
cuml.metrics.cluster.silhouette_score.
cython_silhouette_score
(X, labels, metric='euclidean', chunksize=None, handle=None)[source]¶Calculate the mean silhouette coefficient for the provided data
Given a set of cluster labels for every sample in the provided data, compute the mean intracluster distance (a) and the mean nearestcluster distance (b) for each sample. The silhouette coefficient for a sample is then (b  a) / max(a, b).
 Parameters
 Xarraylike, shape = (n_samples, n_features)
The feature vectors for all samples.
 labelsarraylike, shape = (n_samples,)
The assigned cluster labels for each sample.
 metricstring
A string representation of the distance metric to use for evaluating the silhouette schore. Available options are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, and “sqeuclidean”.
 chunksizeinteger (default = None)
An integer, 1 <= chunksize <= n_rows to tile the pairwise distance matrix computations, so as to reduce the quadratic memory usage of having the entire pairwise distance matrix in GPU memory. If None, chunksize will automically be set to 40000, which through experiments has proved to be a safe number for the computation to run on a GPU with 16 GB VRAM.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
cuml.metrics.cluster.completeness_score.
cython_completeness_score
(labels_true, labels_pred, handle=None) → float[source]¶Completeness metric of a cluster labeling given a ground truth.
A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the homogeneity_score which will be different in general.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
 Parameters
 labels_predarraylike (device or host) shape = (n_samples,)
The labels predicted by the model for the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 labels_truearraylike (device or host) shape = (n_samples,)
The ground truth labels (ints) of the test dataset. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 Returns
 float
The completeness of the predicted labeling given the ground truth. Score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling.
cuml.metrics.cluster.mutual_info_score.
cython_mutual_info_score
(labels_true, labels_pred, handle=None) → float[source]¶Computes the Mutual Information between two clusterings.
The Mutual Information is a measure of the similarity between two labels of the same data.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.
The labels in labels_pred and labels_true are assumed to be drawn from a contiguous set (Ex: drawn from {2, 3, 4}, but not from {2, 4}). If your set of labels looks like {2, 4}, convert them to something like {0, 1}.
 Parameters
 handlecuml.Handle
 labels_predarraylike (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 labels_truearraylike (device or host) shape = (n_samples,)
A clustering of the data (ints) into disjoint subsets. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 float
Mutual information, a nonnegative value
Benchmarking¶
 class
cuml.benchmark.algorithms.
AlgorithmPair
(cpu_class, cuml_class, shared_args, cuml_args={}, cpu_args={}, name=None, accepts_labels=True, cpu_data_prep_hook=None, cuml_data_prep_hook=None, accuracy_function=None, bench_func=<function fit>, setup_cpu_func=None, setup_cuml_func=None)[source]¶Wraps a cuML algorithm and (optionally) a cpubased algorithm (typically scikitlearn, but does not need to be as long as it offers
fit
andpredict
ortransform
methods). Provides mechanisms to run each version with default arguments. If no CPUbased version of the algorithm is available, pass None for the cpu_class when instantiating
 Parameters
 cpu_classclass
Class for CPU version of algorithm. Set to None if not available.
 cuml_classclass
Class for cuML algorithm
 shared_argsdict
Arguments passed to both implementations’s initializer
 cuml_argsdict
Arguments only passed to cuml’s initializer
 cpu_args dict
Arguments only passed to sklearn’s initializer
 accepts_labelsboolean
If True, the fit methods expects both X and y inputs. Otherwise, it expects only an X input.
 data_prep_hookfunction (data > data)
Optional function to run on input data before passing to fit
 accuracy_functionfunction (y_test, y_pred)
Function that returns a scalar representing accuracy
 bench_funccustom function to perform fit/predict/transform
calls.
Methods
run_cpu
(data, **override_args)Runs the cpubased algorithm’s fit method on specified data
run_cuml
(data, **override_args)Runs the cumlbased algorithm’s fit method on specified data
setup_cpu
setup_cuml
cuml.benchmark.algorithms.
algorithm_by_name
(name)[source]¶Returns the algorithm pair with the name ‘name’ (caseinsensitive)
Wrappers to run ML benchmarks
 class
cuml.benchmark.runners.
AccuracyComparisonRunner
(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', test_fraction=0.1, n_reps=1)[source]¶Wrapper to run an algorithm with multiple dataset sizes and compute accuracy and speedup of cuml relative to sklearn baseline.
 class
cuml.benchmark.runners.
BenchmarkTimer
(reps=1)[source]¶Provides a context manager that runs a code block
reps
times and records results to the instance variabletimings
. Use like:timer = BenchmarkTimer(rep=5) for _ in timer.benchmark_runs(): ... do something ... print(np.min(timer.timings))Methods
benchmark_runs
 class
cuml.benchmark.runners.
SpeedupComparisonRunner
(bench_rows, bench_dims, dataset_name='blobs', input_type='numpy', n_reps=1)[source]¶Wrapper to run an algorithm with multiple dataset sizes and compute speedup of cuml relative to sklearn baseline.
Methods
run
cuml.benchmark.runners.
run_variations
(algos, dataset_name, bench_rows, bench_dims, param_override_list=[{}], cuml_param_override_list=[{}], cpu_param_override_list=[{}], dataset_param_override_list=[{}], input_type='numpy', test_fraction=0.1, run_cpu=True, raise_on_error=False, n_reps=1)[source]¶Runs each algo in
algos
once perbench_rows X bench_dims X params_override_list X cuml_param_override_list
combination and returns a dataframe containing timing and accuracy data.
 Parameters
 algosstr or list
Name of algorithms to run and evaluate
 dataset_namestr
Name of dataset to use
 bench_rowslist of int
Dataset row counts to test
 bench_dimslist of int
Dataset column counts to test
 param_override_listlist of dict
Dicts containing parameters to pass to __init__. Each dict specifies parameters to override in one run of the algorithm.
 cuml_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cuml algo only.
 cpu_param_override_listlist of dict
Dicts containing parameters to pass to __init__ of the cpu algo only.
 dataset_param_override_listdict
Dicts containing parameters to pass to dataset generator function
 test_fractionfloat
The fraction of data to use for testing.
 run_cpuboolean
If True, run the cpubased algorithm for comparison
Data generators for cuML benchmarks
The main entry point for consumers is gen_data, which wraps the underlying data generators.
Notes when writing new generators:
 Each generator is a function that accepts:
n_samples (set to 0 for ‘default’)
n_features (set to 0 for ‘default’)
random_state
(and optional generatorspecific parameters)
The function should return a 2tuple (X, y), where X is a Pandas dataframe and y is a Pandas series. If the generator does not produce labels, it can return (X, None)
A set of helper functions (convert_*) can convert these to alternative formats. Future revisions may support generating cudf dataframes or GPU arrays directly instead.
cuml.benchmark.datagen.
gen_data
(dataset_name, dataset_format, n_samples=0, n_features=0, random_state=42, test_fraction=0.0, **kwargs)[source]¶Returns a tuple of data from the specified generator.
 Parameters
 dataset_namestr
Dataset to use. Can be a synthetic generator (blobs or regression) or a specified dataset (higgs currently, others coming soon)
 dataset_formatstr
Type of data to return. (One of cudf, numpy, pandas, gpuarray)
 n_samplesint
Number of samples to include in training set (regardless of test split)
 test_fractionfloat
Fraction of the dataset to partition randomly into the test set. If this is 0.0, no test set will be created.
Regression and Classification¶
Linear Regression¶

class
cuml.
LinearRegression
(*, algorithm='eig', fit_intercept=True, normalize=False, handle=None, verbose=False, output_type=None)¶ LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.
cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.
 Parameters
 algorithm‘eig’ or ‘svd’ (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.
 fit_interceptboolean (default = True)
If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
This parameter is ignored when
fit_intercept
is set to False. If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done. handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.
Applications of LinearRegression
LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).
For additional information, see scikitlearn’s OLS documentation.
For an additional example see the OLS notebook.
Examples
import numpy as np import cudf # Both import methods supported from cuml import LinearRegression from cuml.linear_model import LinearRegression lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) reg = lr.fit(X,y) print("Coefficients:") print(reg.coef_) print("Intercept:") print(reg.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = lr.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Predictions: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)
fit
(self, X, y, convert_dtype=True) → u’LinearRegression’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
Logistic Regression¶

class
cuml.
LogisticRegression
(*, penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, class_weight=None, max_iter=1000, linesearch_max_iter=50, verbose=False, l1_ratio=None, solver='qn', handle=None, output_type=None)¶ LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.
cuML’s LogisticRegression can take arraylike objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It provides both singleclass (using sigmoid loss) and multipleclass (using softmax loss) variants, depending on the input variablesOnly one solver option is currently available: QuasiNewton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:
OrthantWise Limited Memory QuasiNewton (OWLQN) if there is l1 regularization
Limited Memory BFGS (LBFGS) otherwise.
Note that, just like in Scikitlearn, the bias will not be regularized.
 Parameters
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)
Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then LBFGS solver will be used. If ‘l1’ is selected, solver OWLQN will be used. If ‘elasticnet’ is selected, OWLQN will be used if l1_ratio > 0, otherwise LBFGS will be used.
 tol: float (default = 1e4)
The training process will stop if current_loss > previous_loss  tol
 C: float (default = 1.0)
Inverse of regularization strength; must be a positive float.
 fit_intercept: boolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 class_weight: None
Custom class weighs are currently not supported.
 class_weight: dict or ‘balanced’, default=None
By default all classes have a weight one. However, a dictionary can be provided with weights associated with classes in the form
{class_label: weight}
. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data asn_samples / (n_classes * np.bincount(y))
. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. max_iter: int (default = 1000)
Maximum number of iterations taken for the solvers to converge.
 linesearch_max_iter: int (default = 50)
Max number of linesearch iterations per outer iteration used in the lbfgs and owl QN solvers.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. l1_ratio: float or None, optional (default=None)
The ElasticNet mixing parameter, with
0 <= l1_ratio <= 1
 solver: ‘qn’, ‘lbfgs’, ‘owl’ (default=’qn’).
Algorithm to use in the optimization problem. Currently only
qn
is supported, which automatically selects either LBFGS or OWLQN depending on the conditions of the l1 regularization described above. Options ‘lbfgs’ and ‘owl’ are just convenience values that end up using the same solver following the same rules. handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
cuML’s LogisticRegression uses a different solver that the equivalent Scikitlearn, except when there is no penalty and
solver=lbfgs
is used in Scikitlearn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to using different solvers in Scikitlearn.For additional information, see Scikitlearn’s LogistRegression <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_.
Examples
import cudf import numpy as np # Both import methods supported # from cuml import LogisticRegression from cuml.linear_model import LogisticRegression X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) ) reg = LogisticRegression() reg.fit(X,y) print("Coefficients:") print(reg.coef_) print("Intercept:") print(reg.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([1,5], dtype = np.float32) X_new['col2'] = np.array([2,5], dtype = np.float32) preds = reg.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 0.22309814 0.21012752 Intercept: 0.7548761 Predictions: 0 0.0 1 1.0
 Attributes
 coef_: dev array, dim (n_classes, n_features) or (n_classes, n_features+1)
The estimated coefficients for the linear regression model.
 intercept_: device array (n_classes, 1)
The independent term. If
fit_intercept
is False, will be 0.
Methods
decision_function
(self, X[, convert_dtype])Gives confidence score for X
fit
(self, X, y[, sample_weight, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.
predict_log_proba
(self, X[, convert_dtype])Predicts the log class probabilities for each class in X
predict_proba
(self, X[, convert_dtype])Predicts the class probabilities for each class in X

decision_function
(self, X, convert_dtype=False) → CumlArray[source]¶ Gives confidence score for X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the decision_function method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 scorecuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Confidence score
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit
(self, X, y, sample_weight=None, convert_dtype=True) → u’LogisticRegression’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=True) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba
(self, X, convert_dtype=True) → CumlArray[source]¶ Predicts the log class probabilities for each class in X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the predict_log_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Logaright of predicted class probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba
(self, X, convert_dtype=True) → CumlArray[source]¶ Predicts the class probabilities for each class in X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the predict_proba method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Predicted class probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Ridge Regression¶

class
cuml.
Ridge
(*, alpha=1.0, solver='eig', fit_intercept=True, normalize=False, handle=None, output_type=None, verbose=False)¶ Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.
cuML’s Ridge can take arraylike objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It provides 3 algorithms: SVD, Eig and CD to fit a linear model. In general SVD uses significantly more memory and is slower than Eig. If using CUDA 10.1, the memory difference is even bigger than in the other supported CUDA versions. However, SVD is more stable than Eig (default). CD uses Coordinate Descent and can be faster when data is large. Parameters
 alphafloat (default = 1.0)
Regularization strength  must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
 solver{‘eig’, ‘svd’, ‘cd’} (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.
 fit_interceptboolean (default = True)
If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretability on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.
Applications of Ridge
Ridge Regression is used in the same way as LinearRegression, but does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.
For additional docs, see Scikitlearn’s Ridge Regression.
Examples
import numpy as np import cudf # Both import methods supported from cuml import Ridge from cuml.linear_model import Ridge alpha = np.array([1e5]) ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False, solver = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) result_ridge = ridge.fit(X, y) print("Coefficients:") print(result_ridge.coef_) print("Intercept:") print(result_ridge.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_ridge.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Preds: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)
fit
(self, X, y, convert_dtype=True) → u’Ridge’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
Lasso Regression¶

class
cuml.
Lasso
(*, alpha=1.0, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=None, output_type=None, verbose=False)¶ Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection and improves the conditioning of the problem.
cuML’s Lasso can take arraylike objects, either in host as NumPy arrays or in device (as Numba or
__cuda_array_interface__
compliant), in addition to cuDF objects. It uses coordinate descent to fit a linear model. Parameters
 alphafloat (default = 1.0)
Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.
 fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
 max_iterint
The maximum number of iterations
 tolfloat (default = 1e3)
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 selection{‘cyclic’, ‘random’} (default=’cyclic’)
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
For additional docs, see scikitlearn’s Lasso.
Examples
import numpy as np import cudf from cuml.linear_model import Lasso ls = Lasso(alpha = 0.1) X = cudf.DataFrame() X['col1'] = np.array([0, 1, 2], dtype = np.float32) X['col2'] = np.array([0, 1, 2], dtype = np.float32) y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) ) result_lasso = ls.fit(X, y) print("Coefficients:") print(result_lasso.coef_) print("intercept:") print(result_lasso.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_lasso.predict(X_new) print(preds)
Output:
Coefficients: 0 0.85 1 0.0 Intercept: 0.149999 Preds: 0 2.7 1 1.85
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)
fit
(self, X, y, convert_dtype=True) → u’Lasso’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
ElasticNet Regression¶

class
cuml.
ElasticNet
(*, alpha=1.0, l1_ratio=0.5, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, selection='cyclic', handle=None, output_type=None, verbose=False)¶ ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be small, and improves the conditioning of the problem.
cuML’s ElasticNet an arraylike object or cuDF DataFrame, uses coordinate descent to fit a linear model.
 Parameters
 alphafloat (default = 1.0)
Constant that multiplies the L1 term. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
 l1_ratio: float (default = 0.5)
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
 fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
 max_iterint (default = 1000)
The maximum number of iterations
 tolfloat (default = 1e3)
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 selection{‘cyclic’, ‘random’} (default=’cyclic’)
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
For additional docs, see scikitlearn’s ElasticNet.
Examples
import numpy as np import cudf from cuml.linear_model import ElasticNet enet = ElasticNet(alpha = 0.1, l1_ratio=0.5) X = cudf.DataFrame() X['col1'] = np.array([0, 1, 2], dtype = np.float32) X['col2'] = np.array([0, 1, 2], dtype = np.float32) y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) ) result_enet = enet.fit(X, y) print("Coefficients:") print(result_enet.coef_) print("intercept:") print(result_enet.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_enet.predict(X_new) print(preds)
Output:
Coefficients: 0 0.448408 1 0.443341 Intercept: 0.1082506 Preds: 0 3.67018 1 3.22177
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)
fit
(self, X, y, convert_dtype=True) → u’ElasticNet’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
Mini Batch SGD Classifier¶

class
cuml.
MBSGDClassifier
(*, loss='hinge', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)¶ Linear models (linear SVM, logistic regression, or linear regression) fitted by minimizing a regularized empirical loss with minibatch SGD. The MBSGD Classifier implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGDClassifier: * Reduce the batch size * Increase the eta0 * Increase the number of iterations Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.
 Parameters
 loss{‘hinge’, ‘log’, ‘squared_loss’} (default = ‘squared_loss’)
‘hinge’ uses linear SVM
‘log’ uses logistic regression
‘squared_loss’ uses linear regression
 penalty: {‘none’, ‘l1’, ‘l2’, ‘elasticnet’} (default = ‘none’)
‘none’ does not perform any regularization
‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients
‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients
‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization
 l1_ratio: float (default=0.15)
The l1_ratio is used only when
penalty = elasticnet
. The value for l1_ratio should be0 <= l1_ratio <= 1
. Whenl1_ratio = 0
then thepenalty = 'l2'
and ifl1_ratio = 1
thenpenalty = 'l1'
 batch_size: int (default = 32)
It sets the number of samples that will be included in each batch.
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
 eta0float (default = 0.001)
Initial learning rate
 power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
 learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’}
(default = ‘constant’)
optimal
option will be supported in a future versionconstant
keeps the learning rate constantadaptive
changes the learning rate if the training loss or the validation accuracy does not improve forn_iter_no_change
epochs. The old learning rate is generally divided by 5 n_iter_no_changeint (default = 5)
the number of epochs to train without any imporvement in the model
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s SGDClassifier.
Examples
import numpy as np import cudf from cuml.linear_model import MBSGDClassifier as cumlMBSGDClassifier X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32)) pred_data = cudf.DataFrame() pred_data['col1'] = np.asarray([3, 2], dtype=np.float32) pred_data['col2'] = np.asarray([5, 5], dtype=np.float32) cu_mbsgd_classifier = cumlMBSGClassifier(learning_rate='constant', eta0=0.05, epochs=2000, fit_intercept=True, batch_size=1, tol=0.0, penalty='l2', loss='squared_loss', alpha=0.5) cu_mbsgd_classifier.fit(X, y) cu_pred = cu_mbsgd_classifier.predict(pred_data).to_array() print(" cuML intercept : ", cu_mbsgd_classifier.intercept_) print(" cuML coef : ", cu_mbsgd_classifier.coef_) print("cuML predictions : ", cu_pred)
Output:
cuML intercept : 0.7150013446807861 cuML coef : 0 0.27320495 1 0.1875956 dtype: float32 cuML predictions : [1. 1.]
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.

fit
(self, X, y, convert_dtype=True) → u’MBSGDClassifier’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Mini Batch SGD Regressor¶

class
cuml.
MBSGDRegressor
(*, loss='squared_loss', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, verbose=False, output_type=None)¶ Linear regression model fitted by minimizing a regularized empirical loss with minibatch SGD. The MBSGD Regressor implementation is experimental and and it uses a different algorithm than sklearn’s SGDClassifier. In order to improve the results obtained from cuML’s MBSGD Regressor: * Reduce the batch size * Increase the eta0 * Increase the number of iterations Since cuML is analyzing the data in batches using a small eta0 might not let the model learn as much as scikit learn does. Furthermore, decreasing the batch size might seen an increase in the time required to fit the model.
 Parameters
 loss‘squared_loss’ (default = ‘squared_loss’)
‘squared_loss’ uses linear regression
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 l1_ratio: float (default=0.15)
The l1_ratio is used only when
penalty = elasticnet
. The value for l1_ratio should be0 <= l1_ratio <= 1
. Whenl1_ratio = 0
then thepenalty = 'l2'
and ifl1_ratio = 1
thenpenalty = 'l1'
 batch_size: int (default = 32)
It sets the number of samples that will be included in each batch.
 epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
 eta0float (default = 0.001)
Initial learning rate
 power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
 learning_rate{‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’}
(default = ‘constant’)
optimal
option will be supported in a future versionconstant
keeps the learning rate constantadaptive
changes the learning rate if the training loss or the validation accuracy does not improve forn_iter_no_change
epochs. The old learning rate is generally divided by 5 n_iter_no_changeint (default = 5)
the number of epochs to train without any imporvement in the model
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s SGDRegressor.
Examples
import numpy as np import cudf from cuml.linear_model import MBSGDRegressor as cumlMBSGDRegressor X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32)) pred_data = cudf.DataFrame() pred_data['col1'] = np.asarray([3, 2], dtype=np.float32) pred_data['col2'] = np.asarray([5, 5], dtype=np.float32) cu_mbsgd_regressor = cumlMBSGDRegressor(learning_rate='constant', eta0=0.05, epochs=2000, fit_intercept=True, batch_size=1, tol=0.0, penalty='l2', loss='squared_loss', alpha=0.5) cu_mbsgd_regressor.fit(X, y) cu_pred = cu_mbsgd_regressor.predict(pred_data).to_array() print(" cuML intercept : ", cu_mbsgd_regressor.intercept_) print(" cuML coef : ", cu_mbsgd_regressor.coef_) print("cuML predictions : ", cu_pred)
Output:
cuML intercept : 0.7150013446807861 cuML coef : 0 0.27320495 1 0.1875956 dtype: float32 cuML predictions : [2.4725943 2.1993892]
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.

fit
(self, X, y, convert_dtype=True) → u’MBSGDRegressor’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Multiclass Classification¶

class
cuml.multiclass.
MulticlassClassifier
(estimator, *, handle=None, verbose=False, output_type=None, strategy='ovr')[source]¶ Wrapper around scikitlearn multiclass classifiers that allows to choose different multiclass strategies.
The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikitlearn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.
 Parameters
 estimatorcuML estimator
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. strategy: string {‘ovr’, ‘ovo’}, default=’ovr’
Multiclass classification strategy: ‘ovr’: one vs. rest or ‘ovo’: one vs. one
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import MulticlassClassifier >>> from cuml.datasets.classification import make_classification >>> >>> X, y = make_classification(n_samples=10, n_features=6, n_informative=4, ... n_classes=3, random_state=137) >>> >>> cls = MulticlassClassifier(LogisticRegression(), strategy='ovo') >>> cls.fit(X,y) >>> cls.predict(X) array([1, 1, 1, 0, 0, 2, 2, 2, 0, 1])
 Attributes
 classes_float, shape (
n_classes_
) Array of class labels.
 n_classes_int
Number of classes.
 classes_float, shape (
Methods
Calculate the decision function.
fit
(X, y)Fit a multiclass classifier.
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
predict
(X)Predict using multi class classifier.

decision_function
(X) → cuml.common.array.CumlArray[source]¶ Calculate the decision function.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Decision function values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit
(X, y) → cuml.multiclass.multiclass.MulticlassClassifier[source]¶ Fit a multiclass classifier.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

get_param_names
(self)[source]¶ Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it inturn owns. This is to simplify the implementation of
get_params
andset_params
methods.

predict
(X) → cuml.common.array.CumlArray[source]¶ Predict using multi class classifier.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class
cuml.multiclass.
OneVsOneClassifier
(estimator, *args, handle=None, verbose=False, output_type=None)[source]¶ Wrapper around Sckitlearn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikitlearn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.
For documentation see scikitlearn’s OneVsOneClassifier.
 Parameters
 estimatorcuML estimator
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import OneVsOneClassifier >>> from cuml.datasets.classification import make_classification >>> >>> X, y = make_classification(n_samples=10, n_features=6, n_informative=4, ... n_classes=3, random_state=137) >>> >>> cls = OneVsOneClassifier(LogisticRegression()) >>> cls.fit(X,y) >>> cls.predict(X) array([1, 1, 1, 0, 0, 2, 2, 2, 0, 1])
Methods
get_param_names
(self)Returns a list of hyperparameter names owned by this class.

class
cuml.multiclass.
OneVsRestClassifier
(estimator, *args, handle=None, verbose=False, output_type=None)[source]¶ Wrapper around Sckitlearn’s class with the same name. The input can be any kind of cuML compatible array, and the output type follows cuML’s output type configuration rules.
Berofe passing the data to scikitlearn, it is converted to host (numpy) array. Under the hood the data is partitioned for binary classification, and it is transformed back to the device by the cuML estimator. These copies back and forth the device and the host have some overhead. For more details see issue https://github.com/rapidsai/cuml/issues/2876.
For documentation see scikitlearn’s OneVsRestClassifier.
 Parameters
 estimatorcuML estimator
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Examples
>>> from cuml.linear_model import LogisticRegression >>> from cuml.multiclass import OneVsRestClassifier >>> from cuml.datasets.classification import make_classification >>> >>> X, y = make_classification(n_samples=10, n_features=6, n_informative=4, ... n_classes=3, random_state=137) >>> >>> cls = OneVsRestClassifier(LogisticRegression()) >>> cls.fit(X,y) >>> cls.predict(X) array([1, 1, 1, 0, 1, 2, 2, 2, 0, 1])
Methods
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
Mutinomial Naive Bayes¶

class
cuml.
MultinomialNB
(*, alpha=1.0, fit_prior=True, class_prior=None, output_type=None, handle=None, verbose=False)[source]¶ Naive Bayes classifier for multinomial models
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tfidf may also work.
 Parameters
 alphafloat
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
 fit_priorboolean
Whether to learn class prior probabilities or no. If false, a uniform prior will be used.
 class_priorarraylike, size (n_classes)
Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
While cuML only provides the multinomial version currently, the other variants are planned to be included soon. Refer to the corresponding Github issue for updates.
Examples
Load the 20 newsgroups dataset from Scikitlearn and train a Naive Bayes classifier.
import cupy as cp import cupyx from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from cuml.naive_bayes import MultinomialNB # Load corpus twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42) # Turn documents into term frequency vectors count_vect = CountVectorizer() features = count_vect.fit_transform(twenty_train.data) # Put feature vectors and labels on the GPU X = cupyx.scipy.sparse.csr_matrix(features.tocsr(), dtype=cp.float32) y = cp.asarray(twenty_train.target, dtype=cp.int32) # Train model model = MultinomialNB() model.fit(X, y) # Compute accuracy on training set model.score(X, y)
Output:
0.9244298934936523
 Attributes
 class_count_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. class_log_prior_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. classes_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. feature_count_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. feature_log_prob_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(X, y[, sample_weight, convert_dtype])Fit Naive Bayes classifier according to X, y
get_param_names
(self)Returns a list of hyperparameter names owned by this class.
partial_fit
(X, y[, classes, sample_weight, …])Incremental fit on a batch of samples.
predict
(X)Perform classification on an array of test vectors X.
Return logprobability estimates for the test vector X.
Return probability estimates for the test vector X.
Updates the log probabilities.

fit
(X, y, sample_weight=None, convert_dtype=True) → cuml.naive_bayes.naive_bayes.MultinomialNB[source]¶ Fit Naive Bayes classifier according to X, y
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_param_names
(self)[source]¶ Returns a list of hyperparameter names owned by this class. It is expected that every child class overrides this method and appends its extra set of parameters that it inturn owns. This is to simplify the implementation of
get_params
andset_params
methods.

partial_fit
(X, y, classes=None, sample_weight=None, convert_dtype=True) → cuml.naive_bayes.naive_bayes.MultinomialNB[source]¶ Incremental fit on a batch of samples.
This method is expected to be called several times consecutively on different chunks of a dataset so as to implement outofcore or online learning.
This is especially useful when the whole dataset is too big to fit in memory at once.
This method has some performance overhead hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
 Parameters
 X{arraylike, cupy sparse matrix} of shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features
 yarraylike of int32 or int64, shape (n_samples)
Target values.
 classesarraylike of shape (n_classes)
List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
 sample_weightarraylike of shape (n_samples)
Weights applied to individual samples (1. for unweighted). Currently sample weight is ignored
 convert_dtypebool
If True, convert y to the appropriate dtype (int)
 Returns
 selfobject

predict
(X) → cuml.common.array.CumlArray[source]¶ Perform classification on an array of test vectors X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 y_hatcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_rows, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba
(X) → cuml.common.array.CumlArray[source]¶ Return logprobability estimates for the test vector X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 CcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_rows, 1)
Returns the logprobability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute
classes_
.For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba
(X) → cuml.common.array.CumlArray[source]¶ Return probability estimates for the test vector X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 CcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_rows, 1)
Returns the probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute
classes_
.For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Stochastic Gradient Descent¶

class
cuml.
SGD
(*, loss='squared_loss', penalty='none', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, epochs=1000, tol=0.001, shuffle=True, learning_rate='constant', eta0=0.001, power_t=0.5, batch_size=32, n_iter_no_change=5, handle=None, output_type=None, verbose=False)¶ Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.
cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.
 Parameters
 loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)
‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
 eta0float (default = 0.001)
Initial learning rate
 power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
 learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)
optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5
 n_iter_no_changeint (default = 5)
the number of epochs to train without any imporvement in the model
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Examples
import numpy as np import cudf from cuml.solvers import SGD as cumlSGD X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype=np.float32) X['col2'] = np.array([1,2,2,3], dtype=np.float32) y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32)) pred_data = cudf.DataFrame() pred_data['col1'] = np.asarray([3, 2], dtype=np.float32) pred_data['col2'] = np.asarray([5, 5], dtype=np.float32) cu_sgd = cumlSGD(learning_rate='constant', eta0=0.005, epochs=2000, fit_intercept=True, batch_size=2, tol=0.0, penalty='none', loss='squared_loss') cu_sgd.fit(X, y) cu_pred = cu_sgd.predict(pred_data).to_array() print(" cuML intercept : ", cu_sgd.intercept_) print(" cuML coef : ", cu_sgd.coef_) print("cuML predictions : ", cu_pred)
Output:
cuML intercept : 0.0041877031326293945 cuML coef : 0 0.984174 1 0.009776 dtype: float32 cuML predictions : [3.005588 2.0214138]
 Attributes
 classes_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. coef_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.
predictClass
(self, X[, convert_dtype])Predicts the y for X.

fit
(self, X, y, convert_dtype=False) → u’SGD’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predictClass
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predictClass method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Random Forest¶

class
cuml.ensemble.
RandomForestClassifier
(*, split_criterion=0, handle=None, verbose=False, output_type=None, n_bins=32, use_experimental_backend=True, **kwargs)¶ Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.
Note
Note that the underlying algorithm for tree node splits differs from that used in scikitlearn. By default, the cuML Random Forest uses a histogrambased algorithm to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.
Known Limitations: This is an early release of the cuML Random Forest code. It contains a few known limitations:
GPUbased inference is only supported if the model was trained with 32bit (float32) datatypes. CPUbased inference may be used in this case as a slower fallback.
Very deep / very wide models may exhaust available GPU memory. Future versions of cuML will provide an alternative algorithm to reduce memory consumption.
While training the model for multi class classification problems, using deep trees or
max_features=1.0
provides better performance.
 Parameters
 n_estimatorsint (default = 100)
Number of trees in the forest. (Default changed to 100 in cuML 0.11)
 split_criterionThe criterion used to split nodes.
0 for GINI, 1 for ENTROPY 2 and 3 not valid for classification (default = 0)
 split_algoint (default = 1)
The algorithm to determine how nodes are split in the tree. 0 for HIST and 1 for GLOBAL_QUANTILE. HIST currently uses a slower treebuilding algorithm so GLOBAL_QUANTILE is recommended for most cases.
 bootstrapboolean (default = True)
Control bootstrapping. If True, each tree in the forest is built on a bootstrapped sample with replacement. If False, the whole dataset is used to build each tree.
 bootstrap_featuresboolean (default = False)
Control bootstrapping for features. If features are drawn with or without replacement
 max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
 max_depthint (default = 16)
Maximum tree depth. Unlimited (i.e, until leaves are pure), if 1. Unlimited depth is not supported. Note that this default differs from scikitlearn’s random forest, which defaults to unlimited depth.
 max_leavesint (default = 1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, if 1.
 max_featuresint, float, or string (default = ‘auto’)
Ratio of number of features (columns) to consider per node split. If int then max_features/n_features. If float then max_features is used as a fraction. If ‘auto’ then max_features=1/sqrt(n_features). If ‘sqrt’ then max_features=1/sqrt(n_features). If ‘log2’ then max_features=log2(n_features)/n_features.
 n_binsint (default = 32)
Number of bins used by the split algorithm. For large problems, particularly those with highlyskewed input data, increasing the number of bins may improve accuracy.
 min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node. If int, then min_samples_leaf represents the minimum number. If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.
 min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node. If int, then min_samples_split represents the minimum number. If float, then min_samples_split represents a fraction and ceil(min_samples_split * n_rows) is the minimum number of samples for each split.
 min_impurity_decreasefloat (default = 0.0)
Minimum decrease in impurity requried for node to be spilt.
 quantile_per_treeboolean (default = False)
Whether quantile is computed for individual trees in RF. Only relevant when
split_algo = GLOBAL_QUANTILE
.Deprecated since version 0.19: Parameter ‘quantile_per_tree’ is deprecated and will be removed in subsequent release.
 use_experimental_backendboolean (default = True)
If set to true and the following conditions are also met, a new experimental backend for decision tree training will be used. The new backend is available only if
split_algo = 1
(GLOBAL_QUANTILE) andquantile_per_tree = False
(No per tree quantile computation). The new backend is considered stable for classification tasks but not yet for regression tasks. The RAPIDS team is continuing optimization and evaluation of the new backend for regression tasks. max_batch_size: int (default = 128)
Maximum number of nodes that can be processed in a given batch. This is used only when ‘use_experimental_backend’ is true. Does not currently fully guarantee the exact same results.
 random_stateint (default = None)
Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results. Note: Parameter `seed` is removed since release 0.19.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Examples
import numpy as np from cuml.ensemble import RandomForestClassifier as cuRFC X = np.random.normal(size=(10,4)).astype(np.float32) y = np.asarray([0,1]*5, dtype=np.int32) cuml_model = cuRFC(max_features=1.0, n_bins=8, n_estimators=40) cuml_model.fit(X,y) cuml_predict = cuml_model.predict(X) print("Predicted labels : ", cuml_predict)
Output:
Predicted labels : [0 1 0 1 0 1 0 1 0 1]
Methods
convert_to_fil_model
(self[, output_class, …])Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
Converts the cuML RF model to a Treelite model
fit
(self, X, y[, convert_dtype])Perform Random Forest Classification on the input data
get_detailed_text
(self)Obtain the detailed information for the random forest model, as text
get_json
(self)Export the Random Forest model as a JSON string
get_summary_text
(self)Obtain the text summary of the random forest model
predict
(self, X[, predict_model, threshold, …])Predicts the labels for X.
predict_proba
(self, X[, algo, num_classes, …])Predicts class probabilites for X.
score
(self, X, y[, threshold, algo, …])Calculates the accuracy metric score of the model for X.

convert_to_fil_model
(self, output_class=True, threshold=0.5, algo='auto', fil_sparse_format='auto')[source]¶ Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
 Parameters
 output_classboolean (default = True)
This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.
 algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 thresholdfloat (default = 0.5)
Threshold used for classification. Optional and required only while performing the predict operation on the GPU. It is applied if output_class == True, else it is ignored
 fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 fil_model
A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model
(self)[source]¶ Converts the cuML RF model to a Treelite model
 Returns
 tl_to_fil_modelTreelite version of this model

fit
(self, X, y, convert_dtype=True)[source]¶ Perform Random Forest Classification on the input data
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 convert_dtypebool, optional (default = True)
When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.

get_detailed_text
(self)[source]¶ Obtain the detailed information for the random forest model, as text

predict
(self, X, predict_model='GPU', threshold=0.5, algo='auto', num_classes=None, convert_dtype=True, fil_sparse_format='auto') → CumlArray[source]¶ Predicts the labels for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 predict_modelString (default = ‘GPU’)
‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and
X
is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for classification problems. algostring (default =
'auto'
) This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 thresholdfloat (default = 0.5)
Threshold used for classification. Optional and required only while performing the predict operation on the GPU.
 num_classesint (default = None)
number of different classes present in the dataset.
Deprecated since version 0.16: Parameter ‘num_classes’ is deprecated and will be removed in an upcoming version. The number of classes passed must match the number of classes the model was trained on.
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 fil_sparse_formatboolean or string (default =
'auto'
) This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

predict_proba
(self, X, algo='auto', num_classes=None, convert_dtype=True, fil_sparse_format='auto') → CumlArray[source]¶ Predicts class probabilites for X. This function uses the GPU implementation of predict. Therefore, data with ‘dtype = np.float32’ should be used with this function.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 num_classesint (default = None)
number of different classes present in the dataset.
Deprecated since version 0.16: Parameter ‘num_classes’ is deprecated and will be removed in an upcoming version. The number of classes passed must match the number of classes the model was trained on.
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

score
(self, X, y, threshold=0.5, algo='auto', num_classes=None, predict_model='GPU', convert_dtype=True, fil_sparse_format='auto')[source]¶ Calculates the accuracy metric score of the model for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 thresholdfloat
threshold is used to for classification This is optional and required only while performing the predict operation on the GPU.
 num_classesint (default = None)
number of different classes present in the dataset.
Deprecated since version 0.16: Parameter ‘num_classes’ is deprecated and will be removed in an upcoming version. The number of classes passed must match the number of classes the model was trained on.
 convert_dtypeboolean, default=True
whether to convert input data to correct dtype automatically
 predict_modelString (default = ‘GPU’)
‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The ‘GPU’ can only be used if the model was trained on float32 data and
X
is float32 or convert_dtype is set to True. Also the ‘GPU’ should only be used for classification problems. fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 accuracyfloat
Accuracy of the model [0.0  1.0]

class
cuml.ensemble.
RandomForestRegressor
(*, split_criterion=2, accuracy_metric='r2', handle=None, verbose=False, output_type=None, **kwargs)¶ Implements a Random Forest regressor model which fits multiple decision trees in an ensemble.
Note
Note that the underlying algorithm for tree node splits differs from that used in scikitlearn. By default, the cuML Random Forest uses a histogrambased algorithm to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.
Known Limitations: This is an early release of the cuML Random Forest code. It contains a few known limitations:
GPUbased inference is only supported if the model was trained with 32bit (float32) datatypes. CPUbased inference may be used in this case as a slower fallback.
Very deep / very wide models may exhaust available GPU memory. Future versions of cuML will provide an alternative algorithm to reduce memory consumption.
 Parameters
 n_estimatorsint (default = 100)
Number of trees in the forest. (Default changed to 100 in cuML 0.11)
 split_algoint (default = 1)
The algorithm to determine how nodes are split in the tree. 0 for HIST and 1 for GLOBAL_QUANTILE. HIST currently uses a slower treebuilding algorithm so GLOBAL_QUANTILE is recommended for most cases.
 split_criterionint (default = 2)
The criterion used to split nodes. 0 for GINI, 1 for ENTROPY, 2 for MSE, or 3 for MAE 0 and 1 not valid for regression
 bootstrapboolean (default = True)
Control bootstrapping. If True, each tree in the forest is built on a bootstrapped sample with replacement. If False, the whole dataset is used to build each tree.
 bootstrap_featuresboolean (default = False)
Control bootstrapping for features. If features are drawn with or without replacement
 max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
 max_depthint (default = 16)
Maximum tree depth. Unlimited (i.e, until leaves are pure), if 1. Unlimited depth is not supported with split_algo=1. Note that this default differs from scikitlearn’s random forest, which defaults to unlimited depth.
 max_leavesint (default = 1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, if 1.
 max_featuresint, float, or string (default = ‘auto’)
Ratio of number of features (columns) to consider per node split. If int then max_features/n_features. If float then max_features is used as a fraction. If ‘auto’ then max_features=1.0. If ‘sqrt’ then max_features=1/sqrt(n_features). If ‘log2’ then max_features=log2(n_features)/n_features.
 n_binsint (default = 8)
Number of bins used by the split algorithm. For large problems, particularly those with highlyskewed input data, increasing the number of bins may improve accuracy.
 min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node. If int, then min_samples_leaf represents the minimum number. If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.
 min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node. If int, then min_samples_split represents the minimum number. If float, then min_samples_split represents a fraction and ceil(min_samples_split * n_rows) is the minimum number of samples for each split.
 min_impurity_decreasefloat (default = 0.0)
The minimum decrease in impurity required for node to be split
 accuracy_metricstring (default = ‘r2’)
Decides the metric used to evaluate the performance of the model. In the 0.16 release, the default scoring metric was changed from mean squared error to rsquared. for rsquared : ‘r2’ for median of abs error : ‘median_ae’ for mean of abs error : ‘mean_ae’ for mean square error’ : ‘mse’
 quantile_per_treeboolean (default = False)
Whether quantile is computed for individual trees in RF. Only relevant when
split_algo = GLOBAL_QUANTILE
.Deprecated since version 0.19: Parameter ‘quantile_per_tree’ is deprecated and will be removed in subsequent release.
 use_experimental_backendboolean (default = False)
If set to true and the following conditions are also met, a new experimental backend for decision tree training will be used. The new backend is available only if
split_algo = 1
(GLOBAL_QUANTILE) andquantile_per_tree = False
(No per tree quantile computation). The new backend is considered stable for classification tasks but not yet for regression tasks. The RAPIDS team is continuing optimization and evaluation of the new backend for regression tasks. max_batch_size: int (default = 128)
Maximum number of nodes that can be processed in a given batch. This is used only when ‘use_experimental_backend’ is true.
 random_stateint (default = None)
Seed for the random number generator. Unseeded by default. Does not currently fully guarantee the exact same results. Note: Parameter `seed` is removed since release 0.19.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Examples
import numpy as np from cuml.test.utils import get_handle from cuml.ensemble import RandomForestRegressor as curfc from cuml.test.utils import get_handle X = np.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=np.float32) y = np.asarray([0.0,1.0,2.0,3.0], dtype=np.float32) cuml_model = curfc(max_features=1.0, n_bins=8, split_algo=0, min_samples_leaf=1, min_samples_split=2, n_estimators=40, accuracy_metric='r2') cuml_model.fit(X,y) cuml_score = cuml_model.score(X,y) print("MSE score of cuml : ", cuml_score)
Output:
MSE score of cuml : 0.1123437201231765
Methods
convert_to_fil_model
(self[, output_class, …])Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
Converts the cuML RF model to a Treelite model
fit
(self, X, y[, convert_dtype])Perform Random Forest Regression on the input data
get_detailed_text
(self)Obtain the detailed information for the random forest model, as text
get_json
(self)Export the Random Forest model as a JSON string
get_summary_text
(self)Obtain the text summary of the random forest model
predict
(self, X[, predict_model, algo, …])Predicts the labels for X.
score
(self, X, y[, algo, convert_dtype, …])Calculates the accuracy metric score of the model for X.

convert_to_fil_model
(self, output_class=False, algo='auto', fil_sparse_format='auto')[source]¶ Create a Forest Inference (FIL) model from the trained cuML Random Forest model.
 Parameters
 output_classboolean (default = False)
This is optional and required only while performing the predict operation on the GPU. If true, return a 1 or 0 depending on whether the raw prediction exceeds the threshold. If False, just return the raw prediction.
 algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 fil_sparse_formatboolean or string (default = ‘auto’)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 fil_model
A Forest Inference model which can be used to perform inferencing on the random forest model.

convert_to_treelite_model
(self)[source]¶ Converts the cuML RF model to a Treelite model
 Returns
 tl_to_fil_modelTreelite version of this model

fit
(self, X, y, convert_dtype=True)[source]¶ Perform Random Forest Regression on the input data
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

get_detailed_text
(self)[source]¶ Obtain the detailed information for the random forest model, as text

predict
(self, X, predict_model='GPU', algo='auto', convert_dtype=True, fil_sparse_format='auto') → CumlArray[source]¶ Predicts the labels for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 predict_modelString (default = ‘GPU’)
‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and
X
is float32 or convert_dtype is set to True. algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

score
(self, X, y, algo='auto', convert_dtype=True, fil_sparse_format='auto', predict_model='GPU')[source]¶ Calculates the accuracy metric score of the model for X. In the 0.16 release, the default scoring metric was changed from mean squared error to rsquared.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 algostring (default = ‘auto’)
This is optional and required only while performing the predict operation on the GPU.
'naive'
 simple inference using shared memory'tree_reorg'
 similar to naive but trees rearranged to be more coalescingfriendly'batch_tree_reorg'
 similar to tree_reorg but predicting multiple rows per thread block'auto'
 choose the algorithm automatically. Currently'batch_tree_reorg'
is used for dense storage and ‘naive’ for sparse storage
 convert_dtypeboolean, default=True
whether to convert input data to correct dtype automatically
 predict_modelString (default = ‘GPU’)
‘GPU’ to predict using the GPU, ‘CPU’ otherwise. The GPU can only be used if the model was trained on float32 data and
X
is float32 or convert_dtype is set to True. fil_sparse_formatboolean or string (default = auto)
This variable is used to choose the type of forest that will be created in the Forest Inference Library. It is not required while using predict_model=’CPU’.
'auto'
 choose the storage type automatically (currently True is chosen by auto)False
 create a dense forestTrue
 create a sparse forest, requires algo=’naive’ or algo=’auto’
 Returns
 mean_square_errorfloat or
 median_abs_errorfloat or
 mean_abs_errorfloat
Forest Inferencing¶

class
cuml.
ForestInference
(*, handle=None, output_type=None, verbose=False)¶ ForestInference provides GPUaccelerated inference (prediction) for random forest and boosted decision tree models.
This module does not support training models. Rather, users should train a model in another package and save it in a treelitecompatible format. (See https://github.com/dmlc/treelite) Currently, LightGBM, XGBoost and SKLearn GBDT and random forest models are supported.
Users typically create a ForestInference object by loading a saved model file with ForestInference.load. It is also possible to create it from an SKLearn model using ForestInference.load_from_sklearn. The resulting object provides a
predict
method for carrying out inference. Known limitations:
A single row of data should fit into the shared memory of a thread block, which means that more than 12288 features are not supported.
From sklearn.ensemble, only {RandomForest,GradientBoosting,ExtraTrees}{Classifier,Regressor} models are supported. Other sklearn.ensemble models are currently not supported.
Importing large SKLearn models can be slow, as it is done in Python.
LightGBM categorical features are not supported.
Inference uses a dense matrix format, which is efficient for many problems but can be suboptimal for sparse datasets.
Only classification and regression are supported.
Many other random forest implementations including LightGBM, and SKLearn GBDTs make use of 64bit floating point parameters, but the underlying library for ForestInference uses only 32bit parameters. Because of the truncation that will occur when loading such models into ForestInference, you may observe a slight degradation in accuracy.
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional usage examples, see the sample notebook at https://github.com/rapidsai/cuml/blob/branch0.15/notebooks/forest_inference_demo.ipynb
Examples
In the example below, synthetic data is copied to the host before inference. ForestInference can also accept a numpy array directly at the cost of a slight performance overhead.
# Assume that the file 'xgb.model' contains a classifier model that was # previously saved by XGBoost's save_model function. import sklearn, sklearn.datasets, numpy as np from numba import cuda from cuml import ForestInference model_path = 'xgb.model' X_test, y_test = sklearn.datasets.make_classification() X_gpu = cuda.to_device(np.ascontiguousarray(X_test.astype(np.float32))) fm = ForestInference.load(model_path, output_class=True) fil_preds_gpu = fm.predict(X_gpu) accuracy_score = sklearn.metrics.accuracy_score(y_test, np.asarray(fil_preds_gpu))
Methods
load
(filename[, output_class, threshold, …])Returns a FIL instance containing the forest saved in
filename
This uses Treelite to load the saved model.load_from_sklearn
(skl_model[, output_class, …])Creates a FIL model using the scikitlearn model passed to the function.
load_from_treelite_model
(self, model[, …])Creates a FIL model using the treelite model passed to the function.
load_using_treelite_handle
(self, model_handle)Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.
predict
(self, X[, preds])Predicts the labels for X with the loaded forest model.
predict_proba
(self, X[, preds])Predicts the class probabilities for X with the loaded forest model.

static
load
(filename, output_class=False, threshold=0.5, algo='auto', storage_type='auto', blocks_per_sm=0, model_type='xgboost', handle=None)[source]¶ Returns a FIL instance containing the forest saved in
filename
This uses Treelite to load the saved model. Parameters
 filenamestring
Path to saved model file in a treelitecompatible format (See https://treelite.readthedocs.io/en/latest/treeliteapi.html for more information)
 output_class: boolean (default=False)
For a Classification model
output_class
must be True. For a Regression modeloutput_class
must be False. thresholdfloat (default=0.5)
Cutoff value above which a prediction is set to 1.0 Only used if the model is classification and
output_class
is True algostring (default=’auto’)
Which inference algorithm to use. See documentation in
FIL.load_from_treelite_model
 storage_typestring (default=’auto’)
Inmemory storage format to be used for the FIL model. See documentation in
FIL.load_from_treelite_model
 blocks_per_sminteger (default=0)
(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.
0
(default): Launches the number of blocks proportional to the number of data rows>= 1
: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.
 model_typestring (default=”xgboost”)
Format of the saved treelite model to be load. It can be ‘xgboost’, ‘xgboost_json’, ‘lightgbm’.
 Returns
 fil_model
A Forest Inference model which can be used to perform inferencing on the model read from the file.

static
load_from_sklearn
(skl_model, output_class=False, threshold=0.5, algo='auto', storage_type='auto', blocks_per_sm=0, handle=None)[source]¶ Creates a FIL model using the scikitlearn model passed to the function. This function requires Treelite 1.0.0+ to be installed.
 Parameters
 skl_model
The scikitlearn model from which to build the FIL version.
 output_class: boolean (default=False)
For a Classification model
output_class
must be True. For a Regression modeloutput_class
must be False. algostring (default=’auto’)
Name of the algo from (from algo_t enum):
'AUTO'
or'auto'
: Choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage'NAIVE'
or'naive'
: Simple inference using shared memory'TREE_REORG'
or'tree_reorg'
: Similar to naive but trees rearranged to be more coalescingfriendly'BATCH_TREE_REORG'
or'batch_tree_reorg'
: Similar to TREE_REORG but predicting multiple rows per thread block
 thresholdfloat (default=0.5)
Threshold is used to for classification. It is applied only if
output_class == True
, else it is ignored. storage_typestring or boolean (default=’auto’)
Inmemory storage format to be used for the FIL model:
'auto'
: Choose the storage type automatically (currently DENSE is always used)False
: Create a dense forestTrue
: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’
 blocks_per_sminteger (default=0)
(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.
0
(default): Launches the number of blocks proportional to the number of data rows>= 1
: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.
 Returns
 fil_model
A Forest Inference model created from the scikitlearn model passed.

load_from_treelite_model
(self, model, output_class=False, algo='auto', threshold=0.5, storage_type='auto', blocks_per_sm=0)[source]¶ Creates a FIL model using the treelite model passed to the function.
 Parameters
 model
the trained model information in the treelite format loaded from a saved model using the treelite API https://treelite.readthedocs.io/en/latest/treeliteapi.html
 output_class: boolean (default=False)
For a Classification model
output_class
must be True. For a Regression modeloutput_class
must be False. algostring (default=’auto’)
Name of the algo from (from algo_t enum):
'AUTO'
or'auto'
: choose the algorithm automatically. Currently ‘BATCH_TREE_REORG’ is used for dense storage, and ‘NAIVE’ for sparse storage'NAIVE'
or'naive'
: simple inference using shared memory'TREE_REORG'
or'tree_reorg'
: similar to naive but trees rearranged to be more coalescingfriendly'BATCH_TREE_REORG'
or'batch_tree_reorg'
: similar to TREE_REORG but predicting multiple rows per thread block
 thresholdfloat (default=0.5)
Threshold is used to for classification. It is applied only if
output_class == True
, else it is ignored. storage_typestring or boolean (default=’auto’)
Inmemory storage format to be used for the FIL model:
'auto'
: Choose the storage type automatically (currently DENSE is always used)False
: Create a dense forestTrue
: Create a sparse forest. Requires algo=’NAIVE’ or algo=’AUTO’'sparse8'
: (experimental) Create a sparse forest with 8byte nodes. Requires algo=’NAIVE’ or algo=’AUTO’. Can fail if 8byte nodes are not enough to store the forest, e.g. if there are too many nodes in a tree or too many features
 blocks_per_sminteger (default=0)
(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.
0
(default): Launches the number of blocks proportional to the number of data rows>= 1
: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.
 Returns
 fil_model
A Forest Inference model which can be used to perform inferencing on the random forest/ XGBoost model.

load_using_treelite_handle
(self, model_handle, output_class=False, algo='auto', storage_type='auto', threshold=0.5, blocks_per_sm=0)[source]¶ Returns a FIL instance by converting a treelite model to FIL model by using the treelite ModelHandle passed.
 Parameters
 model_handleModelhandle to the treelite forest model
(See https://treelite.readthedocs.io/en/latest/treeliteapi.html for more information)
 output_class: boolean (default=False)
For a Classification model
output_class
must be True. For a Regression modeloutput_class
must be False. thresholdfloat (default=0.5)
Cutoff value above which a prediction is set to 1.0 Only used if the model is classification and
output_class
is True algostring (default=’auto’)
Which inference algorithm to use. See documentation in
FIL.load_from_treelite_model
 storage_typestring (default=’auto’)
Inmemory storage format to be used for the FIL model. See documentation in
FIL.load_from_treelite_model
 blocks_per_sminteger (default=0)
(experimental) Indicates how the number of thread blocks to lauch for the inference kernel is determined.
0
(default): Launches the number of blocks proportional to the number of data rows>= 1
: Attempts to lauch blocks_per_sm blocks per SM. This will fail if blocks_per_sm blocks result in more threads than the maximum supported number of threads per GPU. Even if successful, it is not guaranteed that blocks_per_sm blocks will run on an SM concurrently.
 Returns
 fil_model
A Forest Inference model which can be used to perform inferencing on the random forest model.

predict
(self, X, preds=None) → CumlArray[source]¶ Predicts the labels for X with the loaded forest model. By default, the result is the raw floating point output from the model, unless
output_class
was set to True during model loading.See the documentation of
ForestInference.load
for details. Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy For optimal performance, pass a device array with Cstyle layout
 preds: gpuarray or cudf.Series, shape = (n_samples,)
Optional ‘out’ location to store inference results
 Returns
 GPU array of length n_samples with inference results
 (or ‘preds’ filled with inference results if preds was specified)

predict_proba
(self, X, preds=None) → CumlArray[source]¶ Predicts the class probabilities for X with the loaded forest model. The result is the raw floating point output from the model.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy For optimal performance, pass a device array with Cstyle layout
 preds: gpuarray or cudf.Series, shape = (n_samples,2)
Binary probability output Optional ‘out’ location to store inference results
 Returns
 GPU array of shape (n_samples,2) with inference results
 (or ‘preds’ filled with inference results if preds was specified)
Coordinate Descent¶

class
cuml.
CD
(*, loss='squared_loss', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, normalize=False, max_iter=1000, tol=0.001, shuffle=True, handle=None, output_type=None, verbose=False)¶ Coordinate Descent (CD) is a very common optimization algorithm that minimizes along coordinate directions to find the minimum of a function.
cuML’s CD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset.algorithm The CD algorithm currently works with linear regression and ridge, lasso, and elasticnet penalties.
 Parameters
 loss‘squared_loss’ (Only ‘squared_loss’ is supported right now)
‘squared_loss’ uses linear regression
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization. ‘alpha = 0’ is equivalent to an ordinary least square, solved by the LinearRegression object.
 l1_ratio: float (default = 0.15)
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 max_iterint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The tolerance for the optimization: if the updates are smaller than tol, solver stops.
 shuffleboolean (default = True)
If set to ‘True’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘True’) often leads to significantly faster convergence especially when tol is higher than 1e4.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Examples
import numpy as np import cudf from cuml.solvers import CD as cumlCD cd = cumlCD(alpha=0.0) X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) reg = cd.fit(X,y) print("Coefficients:") print(reg.coef_) print("intercept:") print(reg.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = cd.predict(X_new) print(preds)
Output:
Coefficients: 0 1.0019531 1 1.9980469 Intercept: 3.0 Preds: 0 15.997 1 14.995
 Attributes
 coef_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(self, X, y[, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.

fit
(self, X, y, convert_dtype=False) → u’CD’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
QuasiNewton¶

class
cuml.
QN
(*, loss='sigmoid', fit_intercept=True, l1_strength=0.0, l2_strength=0.0, max_iter=1000, tol=0.001, linesearch_max_iter=50, lbfgs_memory=5, verbose=False, handle=None, output_type=None)¶ QuasiNewton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.
Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:
OrthantWise Limited Memory QuasiNewton (OWLQN) if there is l1 regularization
Limited Memory BFGS (LBFGS) otherwise.
cuML’s QN class can take arraylike objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).
 Parameters
 loss: ‘sigmoid’, ‘softmax’, ‘squared_loss’ (default = ‘squared_loss’)
‘sigmoid’ loss used for single class logistic regression ‘softmax’ loss used for multiclass logistic regression ‘normal’ used for normal/square loss
 fit_intercept: boolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 l1_strength: float (default = 0.0)
l1 regularization strength (if nonzero, will run OWLQN, else LBFGS). Note, that as in Scikitlearn, the bias will not be regularized.
 l2_strength: float (default = 0.0)
l2 regularization strength. Note, that as in Scikitlearn, the bias will not be regularized.
 max_iter: int (default = 1000)
Maximum number of iterations taken for the solvers to converge.
 tol: float (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 linesearch_max_iter: int (default = 50)
Max number of linesearch iterations per outer iteration of the algorithm.
 lbfgs_memory: int (default = 5)
Rank of the lbfgs inverseHessian approximation. Method will use O(lbfgs_memory * D) memory.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
This class contains implementations of two popular QuasiNewton methods:
Limitedmemory Broyden Fletcher Goldfarb Shanno (LBFGS) [Nocedal, Wright  Numerical Optimization (1999)]
Orthantwise limitedmemory quasinewton (OWLQN) [Andrew, Gao  ICML 2007] <https://www.microsoft.com/enus/research/publication/scalabletrainingofl1regularizedloglinearmodels/>
Examples
import cudf import numpy as np # Both import methods supported # from cuml import QN from cuml.solvers import QN X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) ) solver = QN() solver.fit(X,y) # Note: for now, the coefficients also include the intercept in the # last position if fit_intercept=True print("Coefficients:") print(solver.coef_) print("Intercept:") print(solver.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([1,5], dtype = np.float32) X_new['col2'] = np.array([2,5], dtype = np.float32) preds = solver.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 10.647417 0.3267412 17.158297 Intercept: 17.158297 Predictions: 0 0.0 1 1.0
 Attributes
coef_
array, shape (n_classes, n_features)QN.coef_(self)
 intercept_array (n_classes, 1)
The independent term. If
fit_intercept
is False, will be 0.
Methods
fit
(self, X, y[, sample_weight, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the y for X.
score
(self, X, y)
property
coef_
¶

fit
(self, X, y, sample_weight=None, convert_dtype=False) → u’QN’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=False) → CumlArray[source]¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Support Vector Machines¶

class
cuml.svm.
SVC
(CSupport Vector Classification)¶ Construct an SVC classifier for training and predictions.
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 Cfloat (default = 1.0)
Penalty parameter C
 kernelstring (default=’rbf’)
Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.
 degreeint (default=3)
Degree of polynomial kernel function.
 gammafloat or string (default = ‘scale’)
Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:
‘auto’: gamma will be set to
1 / n_features
‘scale’: gamma will be se to
1 / (n_features * X.var())
 coef0float (default = 0.0)
Independent term in kernel function, only signifficant for poly and sigmoid
 tolfloat (default = 1e3)
Tolerance for stopping criterion.
 cache_sizefloat (default = 1024.0)
Size of the kernel cache during training in MiB. Increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.
 class_weightdict or string (default=None)
Weights to modify the parameter C for class i to class_weight[i]*C. The string ‘balanced’ is also accepted, in which case
class_weight[i] = n_samples / (n_classes * n_samples_of_class[i])
 max_iterint (default = 100*n_samples)
Limit the number of outer iterations in the solver
 multiclass_strategystr (‘ovo’ or ‘ovr’, default ‘ovo’)
Multiclass classification strategy.
'ovo'
uses OneVsOneClassifier while'ovr'
selects OneVsRestClassifier nochange_stepsint (default = 1000)
We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e3*tol) for nochange_steps consecutive steps, then we stop training.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. probability: bool (default = False)
Enable or disable probability estimates.
 random_state: int (default = None)
Seed for random number generator (used only when probability = True). Currently this argument is not used and a waring will be printed if the user provides it.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info.
Notes
The solver uses the SMO method to fit the classifier. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2].
For additional docs, see scikitlearn’s SVC.
References
 1
J. Vanek et al. A GPUArchitecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)
 2
Examples
import numpy as np from cuml.svm import SVC X = np.array([[1,1], [2,1], [1,2], [2,2], [1,3], [2,3]], dtype=np.float32); y = np.array([1, 1, 1, 1, 1, 1], dtype=np.float32) clf = SVC(kernel='poly', degree=2, gamma='auto', C=1) clf.fit(X, y) print("Predicted labels:", clf.predict(X))
Output:
Predicted labels: [1. 1. 1. 1. 1. 1.]
 Attributes
 n_support_int
The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see https://github.com/rapidsai/cuml/issues/956 )
 support_int, shape = (n_support)
Device array of support vector indices
 support_vectors_float, shape (n_support, n_cols)
Device array of support vectors
 dual_coef_float, shape = (1, n_support)
Device array of coefficients for support vectors
intercept_
floatSVC.intercept_(self)
 fit_status_int
0 if SVM is correctly fitted
coef_
float, shape (1, n_cols)SVMBase.coef_(self)
 classes_: shape (n_classes_,)
Array of class labels
 n_classes_int
Number of classes
Methods
decision_function
(self, X)Calculates the decision function values for X.
fit
(self, X, y[, sample_weight, convert_dtype])Fit the model with X and y.
get_param_names
(self)predict
(self, X[, convert_dtype])Predicts the class labels for X. The returned y values are the class
predict_log_proba
(self, X)Predicts the log probabilities for X (returns log(predict_proba(x)).
predict_proba
(self, X[, log])Predicts the class probabilities for X.

property
classes_
¶

decision_function
(self, X) → CumlArray[source]¶ Calculates the decision function values for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 resultscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Decision function values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit
(self, X, y, sample_weight=None, convert_dtype=True) → u’SVC’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix of any dtype. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

property
intercept_
¶

predict
(self, X, convert_dtype=True) → CumlArray[source]¶ Predicts the class labels for X. The returned y values are the class labels associated to sign(decision_function(X)).
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_log_proba
(self, X) → CumlArray[source]¶ Predicts the log probabilities for X (returns log(predict_proba(x)).
The model has to be trained with probability=True to use this method.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Log of predicted probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba
(self, X, log=False) → CumlArray[source]¶ Predicts the class probabilities for X.
The model has to be trained with probability=True to use this method.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 log: boolean (default = False)
Whether to return log probabilities.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_classes)
Predicted probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

class
cuml.svm.
SVR
(Epsilon Support Vector Regression)¶ Construct an SVC classifier for training and predictions.
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 Cfloat (default = 1.0)
Penalty parameter C
 kernelstring (default=’rbf’)
Specifies the kernel function. Possible options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. Currently precomputed kernels are not supported.
 degreeint (default=3)
Degree of polynomial kernel function.
 gammafloat or string (default = ‘scale’)
Coefficient for rbf, poly, and sigmoid kernels. You can specify the numeric value, or use one of the following options:
‘auto’: gamma will be set to
1 / n_features
‘scale’: gamma will be se to
1 / (n_features * X.var())
 coef0float (default = 0.0)
Independent term in kernel function, only signifficant for poly and sigmoid
 tolfloat (default = 1e3)
Tolerance for stopping criterion.
 epsilon: float (default = 0.1)
epsilon parameter of the epsironSVR model. There is no penalty associated to points that are predicted within the epsilontube around the target values.
 cache_sizefloat (default = 1024.0)
Size of the kernel cache during training in MiB. Increase it to improve the training time, at the cost of higher memory footprint. After training the kernel cache is deallocated. During prediction, we also need a temporary space to store kernel matrix elements (this can be signifficant if n_support is large). The cache_size variable sets an upper limit to the prediction buffer as well.
 max_iterint (default = 100*n_samples)
Limit the number of outer iterations in the solver
 nochange_stepsint (default = 1000)
We monitor how much our stopping criteria changes during outer iterations. If it does not change (changes less then 1e3*tol) for nochange_steps consecutive steps, then we stop training.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional docs, see Scikitlearn’s SVR.
The solver uses the SMO method to fit the regressor. We use the Optimized Hierarchical Decomposition [1] variant of the SMO algorithm, similar to [2]
References
 1
J. Vanek et al. A GPUArchitecture Optimized Hierarchical Decomposition Algorithm for Support VectorMachine Training, IEEE Transactions on Parallel and Distributed Systems, vol 28, no 12, 3330, (2017)
 2
Examples
import numpy as np from cuml.svm import SVR X = np.array([[1], [2], [3], [4], [5]], dtype=np.float32) y = np.array([1.1, 4, 5, 3.9, 1.], dtype = np.float32) reg = SVR(kernel='rbf', gamma='scale', C=10, epsilon=0.1) reg.fit(X, y) print("Predicted values:", reg.predict(X))
Output:
Predicted values: [1.200474 3.8999617 5.100488 3.7995374 1.0995375]
 Attributes
 n_support_int
The total number of support vectors. Note: this will change in the future to represent number support vectors for each class (like in Sklearn, see Issue #956)
 support_int, shape = [n_support]
Device array of suppurt vector indices
 support_vectors_float, shape [n_support, n_cols]
Device array of support vectors
 dual_coef_float, shape = [1, n_support]
Device array of coefficients for support vectors
intercept_
intSVMBase.intercept_(self)
 fit_status_int
0 if SVM is correctly fitted
coef_
float, shape [1, n_cols]SVMBase.coef_(self)
Methods
fit
(self, X, y[, sample_weight, convert_dtype])Fit the model with X and y.
predict
(self, X[, convert_dtype])Predicts the values for X.

fit
(self, X, y, sample_weight=None, convert_dtype=True) → u’SVR’[source]¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict
(self, X, convert_dtype=True) → CumlArray[source]¶ Predicts the values for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Nearest Neighbors Classification¶

class
cuml.neighbors.
KNeighborsClassifier
(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs) KNearest Neighbors Classifier is an instancebased learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.
 Parameters
 n_neighborsint (default=5)
Default number of neighbors to query
 algorithmstring (default=’brute’)
The query algorithm to use. Currently, only ‘brute’ is supported.
 metricstring (default=’euclidean’).
Distance metric to use.
 weightsstring (default=’uniform’)
Sample weights to use. Currently, only the uniform strategy is supported.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s KNeighborsClassifier.
Examples
from cuml.neighbors import KNeighborsClassifier from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split X, y = make_blobs(n_samples=100, centers=5, n_features=10) knn = KNeighborsClassifier(n_neighbors=10) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80) knn.fit(X_train, y_train) knn.predict(X_test)
Output:
array([3, 1, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 1, 0, 0, 0, 2, 3, 3, 0, 3, 0, 0, 0, 0, 3, 2, 0, 0, 0], dtype=int32)
 Attributes
 classes_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. y
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(self, X, y[, convert_dtype])Fit a GPU index for knearest neighbors classifier model.
get_param_names
(self)predict
(self, X[, convert_dtype])Use the trained knearest neighbors classifier to
predict_proba
(self, X[, convert_dtype])Use the trained knearest neighbors classifier to

fit
(self, X, y, convert_dtype=True) → u’KNeighborsClassifier’[source] Fit a GPU index for knearest neighbors classifier model.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.

get_param_names
(self)[source]

predict
(self, X, convert_dtype=True) → CumlArray[source] Use the trained knearest neighbors classifier to predict the labels for X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Labels predicted
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict_proba
(self, X, convert_dtype=True) → typing.Union[CumlArray, typing.Tuple][source] Use the trained knearest neighbors classifier to predict the label probabilities for X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Labels probabilities
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Nearest Neighbors Regression¶

class
cuml.neighbors.
KNeighborsRegressor
(*, weights='uniform', handle=None, verbose=False, output_type=None, **kwargs) KNearest Neighbors Regressor is an instancebased learning technique, that keeps training samples around for prediction, rather than trying to learn a generalizable set of model parameters.
The KNearest Neighbors Regressor will compute the average of the labels for the k closest neighbors and use it as the label.
 Parameters
 n_neighborsint (default=5)
Default number of neighbors to query
 algorithmstring (default=’brute’)
The query algorithm to use. Currently, only ‘brute’ is supported.
 metricstring (default=’euclidean’).
Distance metric to use.
 weightsstring (default=’uniform’)
Sample weights to use. Currently, only the uniform strategy is supported.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
For additional docs, see scikitlearn’s KNeighborsClassifier.
Examples
from cuml.neighbors import KNeighborsRegressor from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split X, y = make_blobs(n_samples=100, centers=5, n_features=10) knn = KNeighborsRegressor(n_neighbors=10) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80) knn.fit(X_train, y_train) knn.predict(X_test)
Output:
array([3. , 1. , 1. , 3.79999995, 2. , 0. , 3.79999995, 3.79999995, 3.79999995, 0. , 3.79999995, 0. , 1. , 2. , 3. , 1. , 0. , 0. , 0. , 2. , 3. , 3. , 0. , 3. , 3.79999995, 3.79999995, 3.79999995, 3.79999995, 3. , 2. , 3.79999995, 3.79999995, 0. ])
 Attributes
 y
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(self, X, y[, convert_dtype])Fit a GPU index for knearest neighbors regression model.
get_param_names
(self)predict
(self, X[, convert_dtype])Use the trained knearest neighbors regression model to

fit
(self, X, y, convert_dtype=True) → u’KNeighborsRegressor’[source] Fit a GPU index for knearest neighbors regression model.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.

get_param_names
(self)[source]

predict
(self, X, convert_dtype=True) → CumlArray[source] Use the trained knearest neighbors regression model to predict the labels for X
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)
Predicted values
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Clustering¶
KMeans Clustering¶

class
cuml.
KMeans
(*, handle=None, n_clusters=8, max_iter=300, tol=0.0001, verbose=False, random_state=1, init='scalablekmeans++', n_init=1, oversampling_factor=2.0, max_samples_per_batch=32768, output_type=None)¶ KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.
cuML’s KMeans expects an arraylike object or cuDF DataFrame, and supports the scalable KMeans++ initialization method. This method is more stable than randomly selecting K points.
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 n_clustersint (default = 8)
The number of centroids or clusters you want.
 max_iterint (default = 300)
The more iterations of EM, the more accurate, but slower.
 tolfloat64 (default = 1e4)
Stopping criterion when centroid means do not change much.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. random_stateint (default = 1)
If you want results to be the same when you restart Python, select a state.
 init‘scalablekmeans++’, ‘kmeans’ , ‘random’ or an ndarray (default = ‘scalablekmeans++’) # noqa
‘scalablekmeans++’ or ‘kmeans’: Uses fast and stable scalable kmeans++ initialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
 n_init: int (default = 1)
Number of instances the kmeans algorithm will be called with different seeds. The final results will be from the instance that produces lowest inertia out of n_init instances.
 oversampling_factorfloat64 (default = 2.0)
The amount of points to sample in scalable kmeans++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable kmeans++ is oversampling_factor * n_clusters * 8.
 max_samples_per_batchint (default = 32768)
The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or TSNE, and verify that they look appropriate.
Applications of KMeans
The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.
For additional docs, see scikitlearn’s Kmeans.
Examples
# Both import methods supported from cuml import KMeans from cuml.cluster import KMeans import cudf import numpy as np import pandas as pd def np2cudf(df): # convert numpy array to cuDF dataframe df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])}) pdf = cudf.DataFrame() for c,column in enumerate(df): pdf[str(c)] = df[column] return pdf a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]], dtype=np.float32) b = np2cudf(a) print("input:") print(b) print("Calling fit") kmeans_float = KMeans(n_clusters=2) kmeans_float.fit(b) print("labels:") print(kmeans_float.labels_) print("cluster_centers:") print(kmeans_float.cluster_centers_)
Output:
input: 0 1 0 1.0 1.0 1 1.0 2.0 2 3.0 2.0 3 4.0 3.0 Calling fit labels: 0 0 1 0 2 1 3 1 cluster_centers: 0 1 0 1.0 1.5 1 3.5 2.5
 Attributes
 cluster_centers_array
The coordinates of the final clusters. This represents of “mean” of each data cluster.
 labels_array
Which cluster each datapoint belongs to.
Methods
fit
(self, X[, sample_weight])Compute kmeans clustering with X.
fit_predict
(self, X[, sample_weight])Compute cluster centers and predict cluster index for each sample.
fit_transform
(self, X[, convert_dtype])Compute clustering and transform X to clusterdistance space.
get_param_names
(self)predict
(self, X[, convert_dtype, sample_weight])Predict the closest cluster each sample in X belongs to.
score
(self, X[, y, sample_weight, convert_dtype])Opposite of the value of X on the Kmeans objective.
transform
(self, X[, convert_dtype])Transform X to a clusterdistance space.

fit
(self, X, sample_weight=None) → u’KMeans’[source]¶ Compute kmeans clustering with X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

fit_predict
(self, X, sample_weight=None) → CumlArray[source]¶ Compute cluster centers and predict cluster index for each sample.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

fit_transform
(self, X, convert_dtype=False) → CumlArray[source]¶ Compute clustering and transform X to clusterdistance space.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the fit_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)
Transformed data
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

predict
(self, X, convert_dtype=False, sample_weight=None) → CumlArray[source]¶ Predict the closest cluster each sample in X belongs to.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

score
(self, X, y=None, sample_weight=None, convert_dtype=True)[source]¶ Opposite of the value of X on the Kmeans objective.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 sample_weightarraylike (device or host) shape = (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the score method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 scorefloat
Opposite of the value of X on the Kmeans objective.

transform
(self, X, convert_dtype=False) → CumlArray[source]¶ Transform X to a clusterdistance space.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_clusters)
Transformed data
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
DBSCAN¶

class
cuml.
DBSCAN
(*, eps=0.5, handle=None, min_samples=5, metric='euclidean', verbose=False, max_mbytes_per_batch=None, output_type=None, calc_core_sample_indices=True)¶ DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.
cuML’s DBSCAN expects an arraylike object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.
 Parameters
 epsfloat (default = 0.5)
The maximum distance between 2 points such they reside in the same neighborhood.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 min_samplesint (default = 5)
The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
 metric: {‘euclidean’, ‘precomputed’}, default = ‘euclidean’
The metric to use when calculating distances between points. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. max_mbytes_per_batch(optional) int64
Calculate batch size using no more than this number of megabytes for the pairwise distance computation. This enables the tradeoff between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info. calc_core_sample_indices(optional) boolean (default = True)
Indicates whether the indices of the core samples should be calculated. The the attribute
core_sample_indices_
will not be used, setting this to False will avoid unnecessary kernel launches
Notes
DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.
Applications of DBSCAN
DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find nonlinearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisions in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.
For additional docs, see scikitlearn’s DBSCAN.
Examples
# Both import methods supported from cuml import DBSCAN from cuml.cluster import DBSCAN import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) dbscan_float = DBSCAN(eps = 1.0, min_samples = 1) dbscan_float.fit(gdf_float) print(dbscan_float.labels_)
Output:
0 0 1 1 2 2
 Attributes
 labels_arraylike or cuDF series
Which cluster each datapoint belongs to. Noisy samples are labeled as 1. Format depends on cuml global output type and estimator output_type.
 core_sample_indices_arraylike or cuDF series
The indices of the core samples. Only calculated if calc_core_sample_indices==True
Methods
fit
(self, X[, out_dtype])Perform DBSCAN clustering from features.
fit_predict
(self, X[, out_dtype])Performs clustering on X and returns cluster labels.
get_param_names
(self)
fit
(self, X, out_dtype='int32') → u’DBSCAN’[source]¶ Perform DBSCAN clustering from features.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 out_dtype: dtype Determines the precision of the output labels array.
default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.

fit_predict
(self, X, out_dtype='int32') → CumlArray[source]¶ Performs clustering on X and returns cluster labels.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 out_dtype: dtype Determines the precision of the output labels array.
default: “int32”. Valid values are { “int32”, np.int32, “int64”, np.int64}.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster labels
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Agglomerative Clustering¶

class
cuml.
AgglomerativeClustering
(*, n_clusters=2, affinity='euclidean', linkage='single', handle=None, verbose=False, connectivity='knn', n_neighbors=10, output_type=None)¶ Agglomerative Clustering
Recursively merges the pair of clusters that minimally increases a given linkage distance.
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. n_clustersint (default = 2)
The number of clusters to find.
 affinitystr, default=’euclidean’
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, or “cosine”. If connectivity is “knn” only “euclidean” is accepted.
 linkage{“single”}, default=”single”
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observations. The algorithm will merge the pairs of clusters that minimize this criterion.  ‘single’ uses the minimum of the distances between all
observations of the two sets.
 n_neighborsint (default = 15)
The number of neighbors to compute when connectivity = “knn”
 connectivity{“pairwise”, “knn”}, (default = “knn”)
The type of connectivity matrix to compute.  ‘pairwise’ will compute the entire fullyconnected graph of
pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.
‘knn’ will sparsify the fullyconnected connectivity matrix to save memory and enable much larger inputs. “n_neighbors” will control the amount of memory used and the graph will be connected automatically in the event “n_neighbors” was not large enough to connect it.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
 Attributes
 children_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. labels_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
fit
(X[, y, convert_dtype])Agglomerat
fit_predict
(X[, y])Agglomerat
get_param_names
(self)
fit
(X, y=None, convert_dtype=True) → cuml.cluster.agglomerative.AgglomerativeClustering[source]¶ Agglomerat
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
ering.fit(self, X, y=None, convert_dtype=True) > u’AgglomerativeClustering’
Fit the hierarchical clustering from features.

fit_predict
(X, y=None) → cuml.common.array.CumlArray[source]¶ Agglomerat
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
ering.fit_predict(self, X, y=None) > CumlArray
Fit the hierarchical clustering from features and return cluster labels.
 Returns
 predscuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, 1)
Cluster indexes
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Dimensionality Reduction and Manifold Learning¶
Principal Component Analysis¶

class
cuml.
PCA
(*, copy=True, handle=None, iterated_power=15, n_components=None, random_state=None, svd_solver='auto', tol=1e07, verbose=False, whiten=False, output_type=None)¶ PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.
cuML’s PCA expects an arraylike object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.
 Parameters
 copyboolean (default = True)
If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 iterated_powerint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
 n_componentsint (default = None)
The number of top K singular vectors / values you want. Must be <= number(columns). If n_components is not set, then all components are kept:
n_components = min(n_samples, n_features)
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
 svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. whitenboolean (default = False)
If True, decorrelates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multicollinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
 output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
PCA considers linear combinations of features, specifically those that maximize global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or TSNE for a locally important embedding.
Applications of PCA
PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.
For additional docs, see scikitlearn’s PCA.
Examples
# Both import methods supported from cuml import PCA from cuml.decomposition import PCA import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) pca_float = PCA(n_components = 2) pca_float.fit(gdf_float) print(f'components: {pca_float.components_}') print(f'explained variance: {pca_float.explained_variance_}') exp_var = pca_float.explained_variance_ratio_ print(f'explained variance ratio: {exp_var}') print(f'singular values: {pca_float.singular_values_}') print(f'mean: {pca_float.mean_}') print(f'noise variance: {pca_float.noise_variance_}') trans_gdf_float = pca_float.transform(gdf_float) print(f'Inverse: {trans_gdf_float}') input_gdf_float = pca_float.inverse_transform(trans_gdf_float) print(f'Input: {input_gdf_float}')
Output:
components: 0 1 2 0 0.69225764 0.5102837 0.51028395 1 0.72165036 0.48949987 0.4895003 explained variance: 0 8.510402 1 0.48959687 explained variance ratio: 0 0.9456003 1 0.054399658 singular values: 0 4.1256275 1 0.9895422 mean: 0 2.6666667 1 2.3333333 2 2.3333333 noise variance: 0 0.0 transformed matrix: 0 1 0 2.8547091 0.42891636 1 0.121316016 0.80743366 2 2.9760244 0.37851727 Input Matrix: 0 1 2 0 1.0000001 3.9999993 4.0 1 2.0 2.0000002 1.9999999 2 4.9999995 1.0000006 1.0
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
 mean_array
The column wise mean of X. Used to mean  center the data first.
 noise_variance_float
From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.
Methods
fit
(self, X[, y])Fit the model with X. y is currently ignored.
fit_transform
(self, X[, y])Fit the model with X and apply the dimensionality reduction on X.
get_param_names
(self)inverse_transform
(self, X[, convert_dtype, …])Transform data back to its original space.
transform
(self, X[, convert_dtype])Apply dimensionality reduction to X.

fit
(self, X, y=None) → u’PCA’[source]¶ Fit the model with X. y is currently ignored.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

fit_transform
(self, X, y=None) → CumlArray[source]¶ Fit the model with X and apply the dimensionality reduction on X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.

inverse_transform
(self, X, convert_dtype=False, return_sparse=False, sparse_tol=1e10) → CumlArray[source]¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 return_sparsebool, optional (default = False)
Ignored when the model is not fit on a sparse matrix If True, the method will convert the result to a cupyx.scipy.sparse.csr_matrix object. NOTE: Currently, there is a loss of information when converting to csr matrix (cusolver bug). Default will be switched to True once this is solved.
 sparse_tolfloat, optional (default = 1e10)
Ignored when return_sparse=False. If True, values in the inverse transform below this parameter are clipped to 0.
 Returns
 X_invcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_features)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.

transform
(self, X, convert_dtype=False) → CumlArray[source]¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, cupyx.scipy.sparse for sparse output, shape = (n_samples, n_components)
Transformed values
For more information on how to configure cuML’s dense output type, refer to: Output Data Type Configuration.
Incremental PCA¶

class
cuml.
IncrementalPCA
(*, handle=None, n_components=None, whiten=False, copy=True, batch_size=None, verbose=False, output_type=None)[source]¶ Based on sklearn.decomposition.IncrementalPCA from scikitlearn 0.23.1
Incremental principal components analysis (IPCA). Linear dimensionality reduction using Singular Value Decomposition of the data, keeping only the most significant singular vectors to project the data to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD. Depending on the size of the input data, this algorithm can be much more memory efficient than a PCA, and allows sparse input. This algorithm has constant memory complexity, on the order of
batch_size * n_features
, enabling use of np.memmap files without loading the entire file into memory. For sparse matrices, the input is converted to dense in batches (in order to be able to subtract the mean) which avoids storing the entire dense matrix at any one time. The computational overhead of each SVD isO(batch_size * n_features ** 2)
, but only 2 * batch_size samples remain in memory at a time. There will ben_samples / batch_size
SVD computations to get the principal components, versus 1 large SVD of complexityO(n_samples * n_features ** 2)
for PCA. Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 n_componentsint or None, (default=None)
Number of components to keep. If
n_components
isNone
, thenn_components
is set tomin(n_samples, n_features)
. whitenbool, optional
If True, decorrelates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multicollinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
 copybool, (default=True)
If False, X will be overwritten.
copy=False
can be used to save memory but is unsafe for general use. batch_sizeint or None, (default=None)
The number of samples to use for each batch. Only used when calling
fit
. Ifbatch_size
isNone
, thenbatch_size
is inferred from the data and set to5 * n_features
, to provide a balance between approximation accuracy and memory consumption. verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
Implements the incremental PCA model from [1]. This model is an extension of the Sequential KarhunenLoeve Transform from [2]. We have specifically abstained from an optimization used by authors of both papers, a QR decomposition used in specific situations to reduce the algorithmic complexity of the SVD. The source for this technique is [3]. This technique has been omitted because it is advantageous only when decomposing a matrix with
n_samples >= 5/3 * n_features
wheren_samples
andn_features
are the matrix rows and columns, respectively. In addition, it hurts the readability of the implemented algorithm. This would be a good opportunity for future optimization, if it is deemed necessary.References
 1
 2
 3
G. Golub and C. Van Loan. Matrix Computations, Third Edition, Chapter 5, Section 5.4.4, pp. 252253.
 4
C. Bishop, 1999. “Pattern Recognition and Machine Learning”, Section 12.2.1, pp. 574
Examples
>>> from cuml.decomposition import IncrementalPCA >>> import cupy as cp >>> import cupyx >>> >>> X = cupyx.scipy.sparse.random(1000, 4, format='csr', density=0.07) >>> ipca = IncrementalPCA(n_components=2, batch_size=200) >>> ipca.fit(X) >>> >>> # Components: >>> ipca.components_ array([[0.02362926, 0.87328851, 0.15971988, 0.45967206], [0.14643883, 0.11414225, 0.97589354, 0.11471273]]) >>> >>> # Singular Values: >>> ipca.singular_values_ array([4.90298662, 4.54498226]) >>> >>> # Explained Variance: >>> ipca.explained_variance_ array([0.02406334, 0.02067754]) >>> >>> # Explained Variance Ratio: >>> ipca.explained_variance_ratio_ array([0.28018011, 0.24075775]) >>> >>> # Mean: >>> ipca.mean_ array([0.03249896, 0.03629852, 0.03268694, 0.03216601]) >>> >>> # Noise Variance: >>> ipca.noise_variance_.item() 0.003474966583315544
 Attributes
 components_array, shape (n_components, n_features)
Components with maximum variance.
 explained_variance_array, shape (n_components,)
Variance explained by each of the selected components.
 explained_variance_ratio_array, shape (n_components,)
Percentage of variance explained by each of the selected components. If all components are stored, the sum of explained variances is equal to 1.0.
 singular_values_array, shape (n_components,)
The singular values corresponding to each of the selected components. The singular values are equal to the 2norms of the
n_components
variables in the lowerdimensional space. mean_array, shape (n_features,)
Perfeature empirical mean, aggregate over calls to
partial_fit
. var_array, shape (n_features,)
Perfeature empirical variance, aggregate over calls to
partial_fit
. noise_variance_float
The estimated noise covariance following the Probabilistic PCA model from [4].
 n_components_int
The estimated number of components. Relevant when
n_components=None
. n_samples_seen_int
The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across
partial_fit
calls. batch_size_int
Inferred batch size from
batch_size
.
Methods
fit
(X[, y])Fit the model with X, using minibatches of size batch_size.
get_param_names
(self)partial_fit
(X[, y, check_input])Incremental fit with X.
transform
(X[, convert_dtype])Apply dimensionality reduction to X.

fit
(X, y=None) → cuml.decomposition.incremental_pca.IncrementalPCA[source]¶ Fit the model with X, using minibatches of size batch_size.
 Parameters
 Xarraylike or sparse matrix, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
 yIgnored
 Returns
 selfobject
Returns the instance itself.

partial_fit
(X, y=None, check_input=True) → cuml.decomposition.incremental_pca.IncrementalPCA[source]¶ Incremental fit with X. All of X is processed as a single batch.
 Parameters
 Xarraylike or sparse matrix, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
 check_inputbool
Run check_array on X.
 yIgnored
 Returns
 selfobject
Returns the instance itself.

transform
(X, convert_dtype=False) → cuml.common.array.CumlArray[source]¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set, using minibatches of size batch_size if X is sparse.
 Parameters
 Xarraylike or sparse matrix, shape (n_samples, n_features)
New data, where n_samples is the number of samples and n_features is the number of features.
 convert_dtypebool, optional (default = False)
When set to True, the transform method will automatically convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 X_newarraylike, shape (n_samples, n_components)
Truncated SVD¶

class
cuml.
TruncatedSVD
(*, algorithm='full', handle=None, n_components=1, n_iter=15, random_state=None, tol=1e07, verbose=False, output_type=None)¶ TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.
cuML’s TruncatedSVD an arraylike object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.
 Parameters
 algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
 n_iterint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many, many components.
Applications of TruncatedSVD
TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.
For additional documentation, see scikitlearn’s TruncatedSVD docs.
Examples
# Both import methods supported from cuml import TruncatedSVD from cuml.decomposition import TruncatedSVD import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi", n_iter = 20, tol = 1e9) tsvd_float.fit(gdf_float) print(f'components: {tsvd_float.components_}') print(f'explained variance: {tsvd_float._explained_variance_}') exp_var = tsvd_float._explained_variance_ratio_ print(f'explained variance ratio: {exp_var}') print(f'singular values: {tsvd_float._singular_values_}') trans_gdf_float = tsvd_float.transform(gdf_float) print(f'Transformed matrix: {trans_gdf_float}') input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float) print(f'Input matrix: {input_gdf_float}')
Output:
components: 0 1 2 0 0.58725953 0.57233137 0.5723314 1 0.80939883 0.41525528 0.4152552 explained variance: 0 55.33908 1 16.660923 explained variance ratio: 0 0.7685983 1 0.23140171 singular values: 0 7.439024 1 4.0817795 Transformed Matrix: 0 1 2 0 5.1659107 2.512643 1 3.4638448 0.042223275 2 4.0809603 3.2164836 Input matrix: 0 1 2 0 1.0 4.000001 4.000001 1 2.0000005 2.0000005 2.0000007 2 5.000001 0.9999999 1.0000004
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
Methods
fit
(self, X[, y])Fit LSI model on training cudf DataFrame X. y is currently ignored.
fit_transform
(self, X[, y])Fit LSI model to X and perform dimensionality reduction on X.
get_param_names
(self)inverse_transform
(self, X[, convert_dtype])Transform X back to its original space.
transform
(self, X[, convert_dtype])Perform dimensionality reduction on X.

fit
(self, X, y=None) → u’TruncatedSVD’[source]¶ Fit LSI model on training cudf DataFrame X. y is currently ignored.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

fit_transform
(self, X, y=None) → CumlArray[source]¶ Fit LSI model to X and perform dimensionality reduction on X. y is currently ignored.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 Returns
 transcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Reduced version of X
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

inverse_transform
(self, X, convert_dtype=False) → CumlArray[source]¶ Transform X back to its original space. Returns X_original whose transform would be X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the inverse_transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 X_originalcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_features)
X in original space
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

transform
(self, X, convert_dtype=False) → CumlArray[source]¶ Perform dimensionality reduction on X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = False)
When set to True, the transform method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Reduced version of X
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
UMAP¶

class
cuml.
UMAP
(*, n_neighbors=15, n_components=2, n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, init='spectral', verbose=False, a=None, b=None, target_n_neighbors= 1, target_weights=0.5, target_metric='categorical', handle=None, hash_input=False, random_state=None, optim_batch_size=0, callback=None, output_type=None)¶ Uniform Manifold Approximation and Projection
Finds a low dimensional embedding of the data that approximates an underlying manifold.
Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py
The UMAP algorithm is outlined in [1]. This implementation follows the GPUaccelerated version as described in [2].
 Parameters
 n_neighbors: float (optional, default 15)
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
 n_components: int (optional, default 2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any
 n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 learning_rate: float (optional, default 1.0)
The initial learning rate for the embedding optimization.
 init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1skeleton
‘random’: assign initial embedding positions at random.
 min_dist: float (optional, default 0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the
spread
value, which determines the scale at which embedded points will be spread out. spread: float (optional, default 1.0)
The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are. set_op_mix_ratio: float (optional, default 1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 repulsion_strength: float (optional, default 1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
 negative_sample_rate: int (optional, default 5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 transform_queue_size: float (optional, default 4.0)
For transform operations (embedding new points using a trained model this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
 a: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. b: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 hash_input: bool, optional (default = False)
UMAP can hash the training input so that exact embeddings are returned when transform is called on the same data upon which the model was trained. This enables consistent behavior between calling
model.fit_transform(X)
and callingmodel.fit(X).transform(X)
. Not that the CPUbased UMAP reference implementation does this by default. This feature is made optional in the GPU version due to the significant overhead in copying memory to the host for computing the hash. random_stateint, RandomState instance or None, optional (default=None)
random_state is the seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Note: Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floatingpoint additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage.
 optim_batch_size: int (optional, default 100000 / n_components)
Used to maintain the consistency of embeddings for large datasets. The optimization step will be processed with at most optim_batch_size edges at once preventing inconsistencies. A lower batch size will yield more consistently repeatable embeddings at the cost of speed.
 callback: An instance of GraphBasedDimRedCallback class
Used to intercept the internal state of embeddings while they are being trained. Example of callback usage:
from cuml.internals import GraphBasedDimRedCallback class CustomCallback(GraphBasedDimRedCallback): def on_preprocess_end(self, embeddings): print(embeddings.copy_to_host()) def on_epoch_end(self, embeddings): print(embeddings.copy_to_host()) def on_train_end(self, embeddings): print(embeddings.copy_to_host())
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in
cuml.umap
:Using a precomputed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions
In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.
References
 1
 2
 Attributes
 X_m
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide. embedding_
Python descriptor object to control getting/setting
CumlArray
attributes onBase
objects. See the Estimator Guide for an in depth guide.
Methods
find_ab_params
(spread, min_dist)Function taken from UMAPlearn : https://github.com/lmcinnes/umap Fit a, b params for the differentiable curve used in lower dimensional fuzzy simplicial complex construction.
fit
(self, X[, y, convert_dtype, knn_graph])Fit X into an embedded space.
fit_transform
(self, X[, y, convert_dtype, …])Fit X into an embedded space and return that transformed
get_param_names
(self)transform
(self, X[, convert_dtype, knn_graph])Transform X into the existing embedded space and return that
validate_hyperparams
(self)
static
find_ab_params
(spread, min_dist)[source]¶ Function taken from UMAPlearn : https://github.com/lmcinnes/umap Fit a, b params for the differentiable curve used in lower dimensional fuzzy simplicial complex construction. We want the smooth curve (from a predefined family with simple gradient) that best matches an offset exponential decay.

fit
(self, X, y=None, convert_dtype=True, knn_graph=None) → u’UMAP’[source]¶ Fit X into an embedded space.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense or sparse matrix containing floats or doubles. Acceptable dense formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 knn_graphsparse arraylike (device or host)
shape=(n_samples, n_samples) A sparse array containing the knearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that
k>=n_neighbors
, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR

fit_transform
(self, X, y=None, convert_dtype=True, knn_graph=None) → CumlArray[source]¶ Fit X into an embedded space and return that transformed output.
There is a subtle difference between calling fit_transform(X) and calling fit().transform(). Calling fit_transform(X) will train the embeddings on X and return the embeddings. Calling fit(X).transform(X) will train the embeddings on X and then run a second optimization.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 yarraylike (device or host) shape = (n_samples, 1)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 knn_graphsparse arraylike (device or host)
shape=(n_samples, n_samples) A sparse array containing the knearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that
k>=n_neighbors
, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Embedding of the data in lowdimensional space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.

transform
(self, X, convert_dtype=True, knn_graph=None) → CumlArray[source]¶ Transform X into the existing embedded space and return that transformed output.
Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().
Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix containing floats or doubles. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
 convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
 knn_graphsparse arraylike (device or host)
shape=(n_samples, n_samples) A sparse array containing the knearest neighbors of X, where the columns are the nearest neighbor indices for each row and the values are their distances. It’s important that
k>=n_neighbors
, so that UMAP can model the neighbors from this graph, instead of building its own internally. Users using the knn_graph parameter provide UMAP with their own run of the KNN algorithm. This allows the user to pick a custom distance function (sometimes useful on certain datasets) whereas UMAP uses euclidean by default. The custom distance function should match the metric used to train UMAP embeedings. Storing and reusing a knn_graph will also provide a speedup to the UMAP algorithm when performing a grid search. Acceptable formats: sparse SciPy ndarray, CuPy device ndarray, CSR/COO preferred other formats will go through conversion to CSR
 Returns
 X_newcuDF, CuPy or NumPy object depending on cuML’s output type configuration, shape = (n_samples, n_components)
Embedding of the data in lowdimensional space.
For more information on how to configure cuML’s output type, refer to: Output Data Type Configuration.
Random Projections¶

class
cuml.random_projection.
GaussianRandomProjection
(*, handle=None, n_components='auto', eps=0.1, random_state=None, verbose=False, output_type=None)¶ Gaussian Random Projection method derivated from BaseRandomProjection class.
Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.
The components of the random matrix are drawn from N(0, 1 / n_components).
 Parameters
 handlecuml.Handle
Specifies the cuml.handle that holds internal CUDA state for computations in this model. Most importantly, this specifies the CUDA stream that will be used for the model’s computations, so users can run different models concurrently in different streams by creating handles in several streams. If it is None, a new one is created.
 n_componentsint (default = ‘auto’)
Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.
The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.
 epsfloat (default = 0.1)
Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.
 random_stateint (default = None)
Seed used to initilize random generator
 verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*
. See Verbosity Levels for more info. output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level,
cuml.global_settings.output_type
. See Output Data Type Configuration for more info.
Notes
This class is unable to be used with
sklearn.base.clone()
and will raise an exception when called.Inspired by Scikitlearn’s implementation : https://scikitlearn.org/stable/modules/random_projection.html
Examples
from cuml.random_projection import GaussianRandomProjection from sklearn.datasets import make_blobs from sklearn.svm import SVC # dataset generation data, target = make_blobs(n_samples=800, centers=400, n_features=3000, random_state=42) # model fitting model = GaussianRandomProjection(n_components=5, random_state=42).fit(data) # dataset transformation transformed_data = model.t