RandomForestClassifier#
- class cuml.ensemble.RandomForestClassifier(*, split_criterion='gini', verbose=False, output_type=None, **kwargs)[source]#
Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.
Note
Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a quantile-based algorithm to determine splits, rather than an exact count. You can tune the size of the quantiles with the
n_binsparameter.Note
You can export cuML Random Forest models and run predictions with them on machines without an NVIDIA GPUs. See https://docs.rapids.ai/api/cuml/nightly/pickling_cuml_models.html for more details.
- Parameters:
- n_estimatorsint (default = 100)
Number of trees in the forest. (Default changed to 100 i 0.11)
- split_criterionstr or int (default =
'gini') The criterion used to split nodes.
'gini'or0for gini impurity'entropy'or1for information gain (entropy)
- bootstrapboolean (default = True)
Control bootstrapping.
If
True, each tree in the forest is built on a bootstrapped sample with replacement.If
False, the whole dataset is used to build each tree.
- max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
- max_depthint or None (default = 16)
Maximum tree depth. Use
Nonefor unlimited depth (trees grow until all leaves are pure). Must be a positive integer orNone.Note
This default differs from scikit-learn’s random forest, which defaults to unlimited depth.
- max_leavesint (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, If
-1.- max_features{‘sqrt’, ‘log2’, None}, int or float (default = ‘sqrt’)
The number of features to consider per node split:
If an int then
max_featuresis the absolute count of features to be used.If a float then
max_featuresis used as a fraction.If
'sqrt'thenmax_features=1/sqrt(n_features).If
'log2'thenmax_features=log2(n_features)/n_features.If
Nonethenmax_features=n_features
Changed in version 24.06: The default of
max_featureschanged from"auto"to"sqrt".- n_binsint (default = 128)
Maximum number of bins used by the split algorithm per feature. For large problems, particularly those with highly-skewed input data, increasing the number of bins may improve accuracy.
- n_streamsint (default = 4)
Number of parallel streams used for forest building.
- min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If type
int, thenmin_samples_leafrepresents the minimum number.If
float, thenmin_samples_leafrepresents a fraction andceil(min_samples_leaf * n_rows)is the minimum number of samples for each leaf node.
- min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node.
If type
int, then min_samples_split represents the minimum number.If type
float, thenmin_samples_splitrepresents a fraction andmax(2, ceil(min_samples_split * n_rows))is the minimum number of samples for each split.
- min_impurity_decreasefloat (default = 0.0)
Minimum decrease in impurity required for node to be split.
- max_batch_sizeint (default = 4096)
Maximum number of nodes that can be processed in a given batch.
- random_stateint (default = None)
Seed for the random number generator. Unseeded by default.
- oob_scorebool (default = False)
Whether to use out-of-bag samples to estimate the generalization accuracy. Only available if
bootstrap=True. The out-of-bag estimate provides a way to evaluate the model without requiring a separate validation set. The OOB score is computed using accuracy.- verboseint or boolean, default=False
Sets logging level. It must be one of
cuml.common.logger.level_*. See Verbosity Levels for more info.- output_type{‘input’, ‘array’, ‘dataframe’, ‘series’, ‘df_obj’, ‘numba’, ‘cupy’, ‘numpy’, ‘cudf’, ‘pandas’}, default=None
Return results and set estimator attributes to the indicated output type. If None, the output type set at the module level (
cuml.global_settings.output_type) will be used. See Output Data Type Configuration for more info.
- Attributes:
- classes_np.ndarray, shape=(n_classes,)
A sorted array of the class labels.
- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_scoreis True.- oob_decision_function_ndarray of shape (n_samples, n_classes)
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case,
oob_decision_function_might contain NaN. This attribute exists only whenoob_scoreis True.- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances.
Methods
fit(X, y, *[, convert_dtype])Perform Random Forest Classification on the input data
predict(X, *[, threshold, convert_dtype, ...])Predicts the labels for X.
predict_proba(X, *[, convert_dtype, layout, ...])Predicts class probabilities for X.
score(X, y, *[, threshold, convert_dtype, ...])Calculates the accuracy score of the model on test data.
Notes
While training the model for multi class classification problems, using deep trees or
max_features=1.0provides better performance.For additional docs, see scikitlearn’s RandomForestClassifier.
When converting to sklearn using
as_sklearn(), thefeature_importances_attribute will return NaN values. If you need feature importances, save them before conversion:importances = cuml_model.feature_importances_Examples
>>> import cupy as cp >>> from cuml.ensemble import RandomForestClassifier as cuRFC >>> X = cp.random.normal(size=(10,4)).astype(cp.float32) >>> y = cp.asarray([0,1]*5, dtype=cp.int32) >>> cuml_model = cuRFC(max_features=1.0, ... n_bins=8, ... n_estimators=40) >>> cuml_model.fit(X,y) RandomForestClassifier() >>> cuml_predict = cuml_model.predict(X) >>> print("Predicted labels : ", cuml_predict) Predicted labels : [0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]
- fit(X, y, *, convert_dtype=True) RandomForestClassifier[source]#
Perform Random Forest Classification on the input data
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool, optional (default = True)
When set to True, the method will automatically convert the inputs to np.float32.
- dtypebool, optional (default = True)
When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.
- predict(X, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#
Predicts the labels for X.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- thresholdfloat (default = 0.5)
Threshold used for classification.
- convert_dtypebool (default = True)
When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
- layoutstring (default = ‘depth_first’)
Forest layout for GPU inference. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Controls batch subdivision for parallel processing. Optimal value depends on hardware, model and batch size. If None, determined automatically.
- align_bytesint, optional (default = None)
If specified, trees will be padded to this byte alignment, which can improve performance. Typical values are 0 or 128 on GPU.
- Returns:
- ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
- predict_proba(X, *, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None) CumlArray[source]#
Predicts class probabilities for X. This function uses the GPU implementation of predict.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- convert_dtypebool (default = True)
When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.
- Returns:
- ycuDF, CuPy or NumPy object depending on cuML’s output typeconfiguration, shape =(n_samples, 1)
- score(X, y, *, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None)[source]#
Calculates the accuracy score of the model on test data.
- Parameters:
- Xarray-like (device or host) shape = (n_samples, n_features)
Dense matrix. If datatype is other than floats or doubles, then the data will be converted to float which increases memory utilization. Set the parameter convert_dtype to False to avoid this, then the method will throw an error instead. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- yarray-like (device or host) shape = (n_samples, 1)
Dense matrix of type np.int32. Acceptable formats: CUDA array interface compliant objects like CuPy, cuDF DataFrame/Series, NumPy ndarray and Pandas DataFrame/Series.
- thresholdfloat (default = 0.5)
Threshold used for classification predictions
- convert_dtypebool (default = True)
When True, automatically convert the input to the data type used to train the model. This may increase memory usage.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128 on GPU and 0 or 64 on CPU.
- Returns:
- accuracyfloat
Accuracy of the model [0.0 - 1.0]