RandomForestRegressor#

class cuml.ensemble.RandomForestRegressor(*, split_criterion='mse', max_features=1.0, verbose=False, output_type=None, **kwargs)[source]#

Implements a Random Forest regressor model which fits multiple decision trees in an ensemble.

Note

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the n_bins parameter

Note

You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.

Parameters:

n_estimatorsint (default = 100)

Number of trees in the forest. (Default changed to 100 in cuML 0.11)

split_criterionstr or int (default = 'mse')

The criterion used to split nodes.

'mse' or 2 for mean squared error

'poisson' or 4 for poisson half deviance

'gamma' or 5 for gamma half deviance

'inverse_gaussian' or 6 for inverse gaussian deviance

bootstrapboolean (default = True)

Control bootstrapping.

If True, each tree in the forest is built on a bootstrapped sample with replacement.

If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint or None (default = 16)

Maximum tree depth. Use None for unlimited depth (trees grow until all leaves are pure). Must be a positive integer or None.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_features{‘sqrt’, ‘log2’, None}, int or float (default = 1.0)

The number of features to consider per node split:

If an int then max_features is the absolute count of features to be used.
If a float then max_features is used as a fraction.
If 'sqrt' then max_features=1/sqrt(n_features).
If 'log2' then max_features=log2(n_features)/n_features.
If None then max_features=n_features

Changed in version 24.06: The default of max_features changed from "auto" to 1.0.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

If type int, then min_samples_leaf represents the minimum number.

If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

If type int, then min_samples_split represents the minimum number.

If type float, then min_samples_split represents a fraction and max(2, ceil(min_samples_split * n_rows)) is the minimum number of samples for each split.

min_impurity_decreasefloat (default = 0.0)

The minimum decrease in impurity required for node to be split

max_batch_sizeint (default = 4096)

Maximum number of nodes that can be processed in a given batch.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

oob_scorebool (default = False)

Whether to use out-of-bag samples to estimate the generalization performance. Only available if bootstrap=True. The out-of-bag estimate provides a way to evaluate the model without requiring a separate validation set. The OOB score is computed using R² (coefficient of determination).

verboseint or boolean, default=False

Sets logging level. It must be one of cuml.common.logger.level_*. See Verbosity Levels for more info.

output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None

Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.

Attributes:

oob_score_float: Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.
oob_prediction_ndarray of shape (n_samples,) or (n_samples, n_outputs): Prediction computed with out-of-bag estimate on the training set. This attribute exists only when oob_score is True.
feature_importances_ndarray of shape (n_features,): The impurity-based feature importances.

Methods

`fit`(X, y, *[, convert_dtype])	Perform Random Forest Regression on the input data
`predict`(X, *[, convert_dtype, layout, ...])	Predicts the values for X.
`score`(X, y, *[, convert_dtype, layout, ...])	Calculates the r2 score of the model on test data.

Notes

For additional docs, see scikitlearn’s RandomForestRegressor.

When converting to sklearn using as_sklearn(), the feature_importances_ attribute will return NaN values. If you need feature importances, save them before conversion: importances = cuml_model.feature_importances_

Examples

>>> import cupy as cp
>>> from cuml.ensemble import RandomForestRegressor as curfr
>>> X = cp.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=cp.float32)
>>> y = cp.asarray([0.0,1.0,2.0,3.0], dtype=cp.float32)
>>> cuml_model = curfr(max_features=1.0, n_bins=128,
...                    min_samples_leaf=1,
...                    min_samples_split=2,
...                    n_estimators=40)
>>> cuml_model.fit(X,y)
RandomForestRegressor()
>>> cuml_score = cuml_model.score(X,y)
>>> print("R2 score of cuml : ", cuml_score)
R2 score of cuml :  0.9076250195503235

fit(X, y, *, convert_dtype=True) → RandomForestRegressor[source]#

Perform Random Forest Regression on the input data

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

predict(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None) → CumlArray[source]#

Predicts the values for X.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool, optional (default = True): When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.

Returns:

ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)

score(X, y, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#

Calculates the r2 score of the model on test data.

Parameters:

Xarray-like (device or host) shape = (n_samples, n_features): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
yarray-like (device or host) shape = (n_samples, 1): Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
convert_dtypebool (default = True): When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
layoutstring (default = ‘depth_first’): Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
default_chunk_sizeint, optional (default = None): Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
align_bytesint, optional (default = None): If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.

Returns:

r2_scorefloat