RandomForestRegressor#
- class cuml.dask.ensemble.RandomForestRegressor(*, workers=None, client=None, verbose=False, n_estimators=100, random_state=None, ignore_empty_partitions=False, **kwargs)[source]#
Multi-GPU Random Forest regressor model which fits multiple decision tree regressors in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).
- This implementation makes the following assumptions:
The set of Dask workers used between instantiation, fit, and predict are all consistent
Training data comes in the form of cuDF dataframes or Dask Arrays distributed so that each worker has at least one partition.
The distributed algorithm uses an embarrassingly-parallel approach. For a forest with
Ntrees being built onwworkers, each worker simply buildsN/wtrees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so thatrf.fitreceiveswpartitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.Please check the single-GPU implementation of Random Forest regressor for more information about the underlying algorithm.
- Parameters:
- n_estimatorsint (default = 100)
total number of trees in the forest (not per-worker)
- split_criterionint or string (default =
2('mse')) The criterion used to split nodes.
0or'gini'for gini impurity1or'entropy'for information gain (entropy)2or'mse'for mean squared error4or'poisson'for poisson half deviance5or'gamma'for gamma half deviance6or'inverse_gaussian'for inverse gaussian deviance
0,'gini',1,'entropy'not valid for regression- bootstrapboolean (default = True)
Control bootstrapping.
If
True, each tree in the forest is built on a bootstrapped sample with replacement.If
False, the whole dataset is used to build each tree.
- max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
- max_depthint or None (default = 16)
Maximum tree depth. Use
Nonefor unlimited depth (trees grow until all leaves are pure). Must be a positive integer orNone.Note
This default differs from scikit-learn’s random forest, which defaults to unlimited depth.
Changed in version 26.08: The default of
max_depthwill change from16toNone.- max_leavesint (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, If
-1.- max_featuresfloat (default = ‘auto’)
Ratio of number of features (columns) to consider per node split.
If type
intthenmax_featuresis the absolute count of features to be used.If type
floatthenmax_featuresis a fraction.If
'auto'thenmax_features=n_features = 1.0.If
'sqrt'thenmax_features=1/sqrt(n_features).If
'log2'thenmax_features=log2(n_features)/n_features.If
None, thenmax_features = 1.0.
- n_binsint (default = 128)
Maximum number of bins used by the split algorithm per feature.
- min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If type
int, thenmin_samples_leafrepresents the minimum number.If
float, thenmin_samples_leafrepresents a fraction andceil(min_samples_leaf * n_rows)is the minimum number of samples for each leaf node.
- min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node.
If type
int, thenmin_samples_splitrepresents the minimum number.If type
float, thenmin_samples_splitrepresents a fraction andceil(min_samples_split * n_rows)is the minimum number of samples for each split.
- n_streamsint (default = 4 )
Number of parallel streams used for forest building
- workersoptional, list of strings
Dask addresses of workers to use for computation. If None, all available Dask workers will be used.
- random_stateint (default = None)
Seed for the random number generator. Unseeded by default.
- ignore_empty_partitions: Boolean (default = False)
Specify behavior when a worker does not hold any data while splitting. When True, it returns the results from workers with data (the number of trained estimators will be less than n_estimators) When False, throws a RuntimeError.
Methods
fit(X, y[, convert_dtype, broadcast_data])Fit the input data with a Random Forest regression model
get_params([deep])Returns the value of all parameters required to configure this estimator as a dictionary.
predict(X[, convert_dtype, layout, ...])Predicts the regressor outputs for X.
set_params(**params)Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.
partial_inference
- fit(X, y, convert_dtype=False, broadcast_data=False)[source]#
Fit the input data with a Random Forest regression model
IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).
When persisting data, you can use
cuml.dask.common.utils.persist_across_workersto simplify this:X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers) y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers) X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client, [X_dask_cudf, y_dask_cudf])
This is equivalent to calling
persistwith the data and workers):X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf, y_dask_cudf], workers={ X_dask_cudf:workers, y_dask_cudf:workers })
- Parameters:
- XDask cuDF DataFrame or CuPy backed Dask Array (n_rows, n_features)
Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
- yDask cuDF DataFrame or CuPy backed Dask Array (n_rows, 1)
Labels of training examples. y must be partitioned the same way as X
- convert_dtypebool, optional (default = False)
When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.
- broadcast_databool, optional (default = False)
When set to True, the whole dataset is broadcasted to train the workers, otherwise each worker is trained on its partition
- get_params(deep=True)[source]#
Returns the value of all parameters required to configure this estimator as a dictionary.
- Parameters:
- deepboolean (default = True)
- predict(X, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, delayed=True, broadcast_data=False)[source]#
Predicts the regressor outputs for X.
- Parameters:
- XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)
Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
- delayedbool (default = True)
Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.
- broadcast_databool (default = False)
If False, the trees are merged in a single model before the workers perform inference on their share of the prediction workload. When True, trees aren’t merged. Instead each worker infers on the whole prediction workload using its available trees. The results are reduced on the client. May be advantageous when the model is larger than the data used for inference.
- Returns:
- yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)