RandomForestClassifier#

class cuml.ensemble.RandomForestClassifier(*, split_criterion='gini', verbose=False, output_type=None, **kwargs)[source]#

Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

Note

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the n_bins parameter.

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters:
n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 i 0.11)

split_criterionstr or int (default = 'gini')

The criterion used to split nodes.

  • 'gini' or 0 for gini impurity

  • 'entropy' or 1 for information gain (entropy)

bootstrapboolean (default = True)

Control bootstrapping.

  • If True, each tree in the forest is built on a bootstrapped sample with replacement.

  • If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint or None (default = 16)

Maximum tree depth. Use None for unlimited depth (trees grow until all leaves are pure). Must be a positive integer or None.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_features{‘sqrt’, ‘log2’, None}, int or float (default = ‘sqrt’)

The number of features to consider per node split:

  • If an int then max_features is the absolute count of features to be used.

  • If a float then max_features is used as a fraction.

  • If 'sqrt' then max_features=1/sqrt(n_features).

  • If 'log2' then max_features=log2(n_features)/n_features.

  • If None then max_features=n_features

Changed in version 24.06: The default of max_features changed from "auto" to "sqrt".

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4)

Number of parallel streams used for forest building.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

  • If type int, then min_samples_leaf represents the minimum number.

  • If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

  • If type int, then min_samples_split represents the minimum number.

  • If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

Minimum decrease in impurity required for node to be split.

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

oob_scorebool (default = False)

Whether to use out-of-bag samples to estimate the generalization accuracy. Only available if bootstrap=True. The out-of-bag estimate provides a way to evaluate the model without requiring a separate validation set. The OOB score is computed using accuracy.

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:
classes_np.ndarray, shape=(n_classes,)

A sorted array of the class labels.

oob_score_float

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

oob_decision_function_ndarray of shape (n_samples, n_classes)

Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when oob_score is True.

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances.

Methods

fit(X, y, *[, convert_dtype])

Perform Random Forest Classification on the input data

predict(X, *[, threshold, convert_dtype, ...])

Predicts the labels for X.

predict_proba(X, *[, convert_dtype, layout, ...])

Predicts class probabilities for X.

score(X, y, *[, threshold, convert_dtype, ...])

Calculates the accuracy score of the model on test data.

Notes

While training the model for multi class classification problems, using deep trees or max_features=1.0 provides better performance.

For additional docs, see scikitlearn’s RandomForestClassifier.

When converting to sklearn using as_sklearn(), the feature_importances_ attribute will return NaN values. If you need feature importances, save them before conversion: importances = cuml_model.feature_importances_

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestClassifier as cuRFC

>>> X = cp.random.normal(size=(10,4)).astype(cp.float32)
>>> y = cp.asarray([0,1]*5, dtype=cp.int32)

>>> cuml_model = cuRFC(max_features=1.0,
...                    n_bins=8,
...                    n_estimators=40)
>>> cuml_model.fit(X,y)
RandomForestClassifier()
>>> cuml_predict = cuml_model.predict(X)

>>> print("Predicted labels : ", cuml_predict)
Predicted labels :  [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]
fit(X, y, *, convert_dtype=True) RandomForestClassifier[source]#

Perform Random Forest Classification on the input data

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool, optional (default = True)

When set to True, the method will automatically convert the inputs to np.float32.

dtypebool, optional (default = True)

When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.

predict(X, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#

Predicts the labels for X.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

thresholdfloat (default = 0.5)

Threshold used for classification.

convert_dtypebool (default = True)

When True, automatically convert the input to the data type used to train the model. This may increase memory usage.

layoutstring (default = ‘depth_first’)

Forest layout for GPU inference. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.

default_chunk_sizeint, optional (default = None)

Controls batch subdivision for parallel processing. Optimal value depends on hardware, model and batch size. If None, determined automatically.

align_bytesint, optional (default = None)

If specified, trees will be padded to this byte alignment, which can improve performance. Typical values are 0 or 128 on GPU.

Returns:
ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
predict_proba(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None) CumlArray[source]#

Predicts class probabilities for X. This function uses the GPU implementation of predict.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

convert_dtypebool (default = True)

When True, automatically convert the input to the data type used to train the model. This may increase memory usage.

layoutstring (default = ‘depth_first’)

Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.

default_chunk_sizeint, optional (default = None)

Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.

align_bytesint, optional (default = None)

If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.

Returns:
ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
score(X, y, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#

Calculates the accuracy score of the model on test data.

Parameters:
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

yarray-like (device or host) shape = (n_samples, 1)

Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.

thresholdfloat (default = 0.5)

Threshold used for classification predictions

convert_dtypebool (default = True)

When True, automatically convert the input to the data type used to train the model. This may increase memory usage.

layoutstring (default = ‘depth_first’)

Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.

default_chunk_sizeint, optional (default = None)

Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.

align_bytesint, optional (default = None)

If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.

Returns:
accuracyfloat

Accuracy of the model [0.0 - 1.0]