RandomForestClassifier#
- class cuml.dask.ensemble.RandomForestClassifier(*, workers=None, client=None, verbose=False, n_estimators=100, random_state=None, ignore_empty_partitions=False, **kwargs)[source]#
Multi-GPU Random Forest classifier model which fits multiple decision tree classifiers in an ensemble. This uses Dask to partition data over multiple GPUs (possibly on different nodes).
- This implementation makes the following assumptions:
The set of Dask workers used between instantiation, fit, and predict are all consistent
Training data comes in the form of cuDF dataframes or Dask Arrays distributed so that each worker has at least one partition.
The distributed algorithm uses an embarrassingly-parallel approach. For a forest with
Ntrees being built onwworkers, each worker simply buildsN/wtrees on the data it has available locally. In many cases, partitioning the data so that each worker builds trees on a subset of the total dataset works well, but it generally requires the data to be well-shuffled in advance. Alternatively, callers can replicate all of the data across workers so thatrf.fitreceiveswpartitions, each containing the same data. This would produce results approximately identical to single-GPU fitting.Please check the single-GPU implementation of Random Forest classifier for more information about the underlying algorithm.
- Parameters:
- n_estimatorsint (default = 100)
total number of trees in the forest (not per-worker)
- split_criterionint or string (default =
0('gini')) The criterion used to split nodes.
0or'gini'for gini impurity1or'entropy'for information gain (entropy)2or'mse'for mean squared error4or'poisson'for poisson half deviance5or'gamma'for gamma half deviance6or'inverse_gaussian'for inverse gaussian deviance
2,'mse',4,'poisson',5,'gamma',6,'inverse_gaussian'not valid for classification- bootstrapboolean (default = True)
Control bootstrapping.
If
True, each tree in the forest is built on a bootstrapped sample with replacement.If
False, the whole dataset is used to build each tree.
- max_samplesfloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
- max_depthint or None (default = 16)
Maximum tree depth. Use
Nonefor unlimited depth (trees grow until all leaves are pure). Must be a positive integer orNone.Note
This default differs from scikit-learn’s random forest, which defaults to unlimited depth.
Changed in version 26.08: The default of
max_depthwill change from16toNone.- max_leavesint (default = -1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, If
-1.- max_featuresfloat (default = ‘auto’)
Ratio of number of features (columns) to consider per node split.
If type
intthenmax_featuresis the absolute count of features to be used.If type
floatthenmax_featuresis a fraction.If
'auto'thenmax_features=n_features = 1.0.If
'sqrt'thenmax_features=1/sqrt(n_features).If
'log2'thenmax_features=log2(n_features)/n_features.If
None, thenmax_features = 1.0.
- n_binsint (default = 128)
Maximum number of bins used by the split algorithm per feature.
- min_samples_leafint or float (default = 1)
The minimum number of samples (rows) in each leaf node.
If type
int, thenmin_samples_leafrepresents the minimum number.If
float, thenmin_samples_leafrepresents a fraction andceil(min_samples_leaf * n_rows)is the minimum number of samples for each leaf node.
- min_samples_splitint or float (default = 2)
The minimum number of samples required to split an internal node.
If type
int, thenmin_samples_splitrepresents the minimum number.If type
float, thenmin_samples_splitrepresents a fraction andceil(min_samples_split * n_rows)is the minimum number of samples for each split.
- n_streamsint (default = 4 )
Number of parallel streams used for forest building
- workersoptional, list of strings
Dask addresses of workers to use for computation. If None, all available Dask workers will be used.
- random_stateint (default = None)
Seed for the random number generator. Unseeded by default.
- ignore_empty_partitions: Boolean (default = False)
Specify behavior when a worker does not hold any data while splitting. When True, it returns the results from workers with data (the number of trained estimators will be less than n_estimators) When False, throws a RuntimeError.
Methods
fit(X, y[, convert_dtype, broadcast_data])Fit the input data with a Random Forest classifier
get_params([deep])Returns the value of all parameters required to configure this estimator as a dictionary.
predict(X[, threshold, convert_dtype, ...])Predicts the labels for X.
predict_proba(X[, delayed])Predicts the probability of each class for X.
set_params(**params)Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.
partial_inference
Examples
For usage examples, please see the RAPIDS notebooks repository: rapidsai/cuml
- fit(X, y, convert_dtype=False, broadcast_data=False)[source]#
Fit the input data with a Random Forest classifier
IMPORTANT: X is expected to be partitioned with at least one partition on each Dask worker being used by the forest (self.workers).
If a worker has multiple data partitions, they will be concatenated before fitting, which will lead to additional memory usage. To minimize memory consumption, ensure that each worker has exactly one partition.
When persisting data, you can use
cuml.dask.common.utils.persist_across_workersto simplify this:X_dask_cudf = dask_cudf.from_cudf(X_cudf, npartitions=n_workers) y_dask_cudf = dask_cudf.from_cudf(y_cudf, npartitions=n_workers) X_dask_cudf, y_dask_cudf = persist_across_workers(dask_client, [X_dask_cudf, y_dask_cudf])
This is equivalent to calling
persistwith the data and workers:X_dask_cudf, y_dask_cudf = dask_client.persist([X_dask_cudf, y_dask_cudf], workers={ X_dask_cudf:workers, y_dask_cudf:workers })
- Parameters:
- XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)
Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
- yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
Labels of training examples. y must be partitioned the same way as X
- convert_dtypebool, optional (default = False)
When set to True, the fit method will, when necessary, convert y to be of dtype int32. This will increase memory used for the method.
- broadcast_databool, optional (default = False)
When set to True, the whole dataset is broadcasted to train the workers, otherwise each worker is trained on its partition
- get_params(deep=True)[source]#
Returns the value of all parameters required to configure this estimator as a dictionary.
- Parameters:
- deepboolean (default = True)
- predict(X, threshold=0.5, convert_dtype=True, layout='depth_first', default_chunk_size=None, align_bytes=None, delayed=True, broadcast_data=False)[source]#
Predicts the labels for X.
- Parameters:
- XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)
Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
- thresholdfloat (default = 0.5)
Threshold used for classification.
- convert_dtypebool, optional (default = True)
When set to True, the predict method will, when necessary, convert the input to the data type which was used to train the model. This will increase memory used for the method.
- layoutstring (default = ‘depth_first’)
Specifies the in-memory layout of nodes in FIL forests. Options: ‘depth_first’, ‘layered’, ‘breadth_first’.
- default_chunk_sizeint, optional (default = None)
Determines how batches are further subdivided for parallel processing. The optimal value depends on hardware, model, and batch size. If None, will be automatically determined.
- align_bytesint, optional (default = None)
If specified, trees will be padded such that their in-memory size is a multiple of this value. This can improve performance by guaranteeing that memory reads from trees begin on a cache line boundary. Typical values are 0 or 128.
- delayedbool (default = True)
Whether to do a lazy prediction (and return Delayed objects) or an eagerly executed one.
- broadcast_databool (default = False)
If False, the trees are merged in a single model before the workers perform inference on their share of the prediction workload. When True, trees aren’t merged. Instead each worker infers on the whole prediction workload using its available trees. The results are reduced on the client. May be advantageous when the model is larger than the data used for inference.
- Returns:
- yDask cuDF dataframe or CuPy backed Dask Array (n_rows, 1)
The predicted class labels.
- predict_proba(X, delayed=True, **kwargs)[source]#
Predicts the probability of each class for X.
See documentation of
predictfor notes on performance.- Parameters:
- XDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_features)
Distributed dense matrix (floats or doubles) of shape (n_samples, n_features).
- delayedbool (default = True)
Whether to do a lazy prediction (True) or an eager prediction (False)
- **kwargsdict
Additional predict parameters passed to the underlying model’s predict method. See RandomForestClassifier.predict_proba documentation for a full list.
- Returns:
- yDask cuDF dataframe or CuPy backed Dask Array (n_rows, n_classes)