RandomForestRegressor#
- class cuml.ensemble.RandomForestRegressor(*, split_criterion='mse', max_features=1.0, verbose=False, output_type=None, **kwargs)[source]#
Implements a Random Forest regressor model which fits multiple decision trees in an ensemble.
Note
Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the
n_binsparameterNote
You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.
- Parameters:
- n_estimatorsint (default = 100)
Number of trees in the forest. (Default changed to 100 in cuML 0.11)
- split_criterionstr or int (default =
'mse') The criterion used to split nodes.
'mse'or2for mean squared error'poisson'or4for poisson half deviance'gamma'or5for gamma half deviance'inverse_gaussian'or6for inverse gaussian deviance
- bootstrapboolean (default = True)
Control bootstrapping.
If
True, each tree in the forest is built on a bootstrapped sample with replacement.If
False, the whole dataset is used to build each tree.
- max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
- max_depthint or None (default = 16)
Maximum tree depth. Use
Nonefor unlimited depth (trees grow until all leaves are pure). Must be a positive integer orNone.Note
This default differs from scikit-learn’s random forest, which defaults to unlimited depth.
- max_leavesint (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, If
-1.- max_features{‘sqrt’, ‘log2’, None}, int or float (default = 1.0)
The number of features to consider per node split:
If an int then
max_featuresis the absolute count of features to be used.If a float then
max_featuresis used as a fraction.If
'sqrt'thenmax_features=1/sqrt(n_features).If
'log2'thenmax_features=log2(n_features)/n_features.If
Nonethenmax_features=n_features
Changed in version 24.06: The default of
max_featureschanged from"auto"to 1.0.- n_binsint (default = 128)
Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.
- n_streamsint (default = 4 )
Number of parallel streams used for forest building
- min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If type
int, thenmin_samples_leafrepresents the minimum number.If
float, thenmin_samples_leafrepresents a fraction andceil(min_samples_leaf * n_rows)is the minimum number of samples for each leaf node.
- min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node.
If type
int, then min_samples_split represents the minimum number.If type
float, thenmin_samples_splitrepresents a fraction andmax(2, ceil(min_samples_split * n_rows))is the minimum number of samples for each split.
- min_impurity_decreasefloat (default = 0.0)
The minimum decrease in impurity required for node to be split
- max_batch_sizeint (default = 4096)
Maximum number of nodes that can be processed in a given batch.
- random_stateint (default = None)
Seed for the random number generator. Unseeded by default.
- oob_scorebool (default = False)
Whether to use out-of-bag samples to estimate the generalization performance. Only available if
bootstrap=True. The out-of-bag estimate provides a way to evaluate the model without requiring a separate validation set. The OOB score is computed using R² (coefficient of determination).- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_scoreis True.- oob_prediction_ndarray of shape (n_samples,) or (n_samples, n_outputs)
Prediction computed with out-of-bag estimate on the training set. This attribute exists only when
oob_scoreis True.- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances.
Methods
fit(X, y, *[, convert_dtype])Perform Random Forest Regression on the input data
predict(X, *[, convert_dtype, layout, ...])Predicts the values for X.
score(X, y, *[, convert_dtype, layout, ...])Calculates the r2 score of the model on test data.
Notes
For additional docs, see scikitlearn’s RandomForestRegressor.
When converting to sklearn using
as_sklearn(), thefeature_importances_attribute will return NaN values. If you need feature importances, save them before conversion:importances = cuml_model.feature_importances_Examples
>>> import cupy as cp >>> from cuml.ensemble import RandomForestRegressor as curfr >>> X = cp.asarray([[0,10],[0,20],[0,30],[0,40]], dtype=cp.float32) >>> y = cp.asarray([0.0,1.0,2.0,3.0], dtype=cp.float32) >>> cuml_model = curfr(max_features=1.0, n_bins=128, ... min_samples_leaf=1, ... min_samples_split=2, ... n_estimators=40) >>> cuml_model.fit(X,y) RandomForestRegressor() >>> cuml_score = cuml_model.score(X,y) >>> print("R2 score of cuml : ", cuml_score) R2 score of cuml : 0.9076250195503235
- fit(X, y, *, convert_dtype=True) RandomForestRegressor[source]#
Perform Random Forest Regression on the input data
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the train method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- predict(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None) CumlArray[source]#
Predicts the values for X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
- Returns:
- ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
- score(X, y, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#
Calculates the r2 score of the model on test data.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool (default = True)
When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
- Returns:
- r2_scorefloat