RandomForestRegressor#

class cuml.dask.ensemble.RandomForestRegressor(*, workers=None, client=None, verbose=False, n_estimators=100, random_state=None, ignore_empty_partitions=False, **kwargs)[source]#

Multi-GPU Random Forest regressor model which fits multiple decision tree regressors in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).

This implementation makes the following assumptions:
  • The set of Dask workers used between instantiation, fit, and predict are all consistent

  • Training data comes in the form of cuDF dataframes or Dask Arrays distributed so that each worker has at least one partition.

The distributed algorithm uses an embarrassingly-parallel approach. For a forest with N trees being built on w workers, each worker simply builds N/w trees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so that rf.fit receives w partitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.

Please check the single-GPU implementation of Random Forest regressor for more information about the underlying algorithm.

Parameters:
n_estimatorsint (default = 100)

total number of trees in the forest (not per-worker)

split_criterionint or string (default = 2 ('mse'))

The criterion used to split nodes.

  • 0 or 'gini' for gini impurity

  • 1 or 'entropy' for information gain (entropy)

  • 2 or 'mse' for mean squared error

  • 4 or 'poisson' for poisson half deviance

  • 5 or 'gamma' for gamma half deviance

  • 6 or 'inverse_gaussian' for inverse gaussian deviance

0, 'gini', 1, 'entropy' not valid for regression

bootstrapboolean (default = True)

Control bootstrapping.

  • If True, each tree in the forest is built on a bootstrapped sample with replacement.

  • If False, the whole dataset is used to build each tree.

max_samplesfloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint or None (default = 16)

Maximum tree depth. Use None for unlimited depth (trees grow until all leaves are pure). Must be a positive integer or None.

Note

This default differs from scikit-learn’s random forest, which defaults to unlimited depth.

Changed in version 26.08: The default of max_depth will change from 16 to None.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, If -1.

max_featuresfloat (default = ‘auto’)

Ratio of number of features (columns) to consider per node split.

  • If type int then max_features is the absolute count of features to be used.

  • If type float then max_features is a fraction.

  • If 'auto' then max_features=n_features = 1.0.

  • If 'sqrt' then max_features=1/sqrt(n_features).

  • If 'log2' then max_features=log2(n_features)/n_features.

  • If None, then max_features = 1.0.

n_binsint (default = 128)

Maximum number of bins used by the split algorithm per feature.

min_samples_leafint or float (default = 1)

The minimum number of samples (rows) in each leaf node.

  • If type int, then min_samples_leaf represents the minimum number.

  • If float, then min_samples_leaf represents a fraction and ceil(min_samples_leaf * n_rows) is the minimum number of samples for each leaf node.

min_samples_splitint or float (default = 2)

The minimum number of samples required to split an internal node.

  • If type int, then min_samples_split represents the minimum number.

  • If type float, then min_samples_split represents a fraction and ceil(min_samples_split * n_rows) is the minimum number of samples for each split.

n_streamsint (default = 4 )

Number of parallel streams used for forest building

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used.

random_stateint (default = None)

Seed for the random number generator. Unseeded by default.

ignore_empty_partitions: Boolean (default = False)

Specify behavior when a worker does not hold any data while splitting. When True, it returns the results from workers with data (the number of trained estimators will be less than n_estimators) When False, throws a RuntimeError.

Methods

fit(X, y[, convert_dtype, broadcast_data])

Fit the input data with a Random Forest regression model

get_params([deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(X[, convert_dtype, layout, ...])

Predicts the regressor outputs for X.

set_params(**params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

partial_inference

fit(X, y, convert_dtype=False, broadcast_data=False)[source]#

Fit the input data with a Random Forest regression model

IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).

When persisting data, you can use cuml.dask.common.utils.persist_across_workers to simplify this:

X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers)
y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers)
X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client,
                                                  [X_dask_cudf,
                                                   y_dask_cudf])

This is equivalent to calling persist with the data and workers):

X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf,
                                                y_dask_cudf],
                                               workers={
                                               X_dask_cudf:workers,
                                               y_dask_cudf:workers
                                               })
Parameters:
XDask cuDF DataFrame or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

yDask cuDF DataFrame or CuPy backed Dask Array (n_rows, 1)

Labels of training examples. y must be partitioned the same way as X

convert_dtypebool, optional (default = False)

When set to True, the fit method will, when necessary, convert y to be the same data type as X if they differ. This will increase memory used for the method.

broadcast_databool, optional (default = False)

When set to True, the whole dataset is broadcasted to train the workers, otherwise each worker is trained on its partition

get_params(deep=True)[source]#

Returns the value of all parameters required to configure this estimator as a dictionary.

Parameters:
deepboolean (default = True)
predict(X, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, delayed=True, broadcast_data=False)[source]#

Predicts the regressor outputs for X.

Parameters:
XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)

Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).

convert_dtypebool, optional (default = True)

When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.

layoutstring (default = ‘depth_first’)

Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.

default_chunk_sizeint, optional (default = None)

Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.

align_bytesint, optional (default = None)

If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.

delayedbool (default = True)

Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.

broadcast_databool (default = False)

If False, the trees are merged in a single model before the workers perform inference on their share of the prediction workload. When True, trees aren’t merged. Instead each worker infers on the whole prediction workload using its available trees. The results are reduced on the client. May be advantageous when the model is larger than the data used for inference.

Returns:
yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
set_params(**params)[source]#

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

Parameters:
paramsdict of new params.