Python API#

nvforest.load_model(model_file: str | Path, *, model_type: str | None = None, device: str = 'auto', layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None, device_id: int | None = None, handle: Handle | None = None) ForestInference[source]#

Load a model into nvForest from a serialized model file.

Parameters:
model_file

The path to the serialized model file. This can be an XGBoost binary or JSON file, a LightGBM text file, or a Treelite checkpoint file. If the model_type parameter is not passed, an attempt will be made to load the file based on its extension.

model_type{“xgboost_ubj”, “xgboost_json”, “xgboost_legacy”, “lightgbm”,

“treelite_checkpoint”, None}, default=None The serialization format for the model file. If None, a best-effort guess will be made based on the file extension.

device: {“auto”, “gpu”, “cpu”}, default=”auto”

Whether to use GPU or CPU for inferencing. If set to “auto”, GPU will be selected if it is available.

layout{“breadth_first”, “depth_first”, “layered”}, default=”depth_first”

The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.

default_chunk_sizeint or None, default=None

If set, predict calls without a specified chunk size will use this default value.

align_bytesint or None, default=None

Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.

precision{“single”, “double”, None}, default=None

Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from nvForest and the original training framework is of paramount importance.

device_idint or None, default=None

For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.

handlenvforest.Handle or None

For GPU execution, the nvForest handle containing the stream or stream pool to use during loading and inference. If not given, a new handle will be constructed.

nvforest.load_from_sklearn(skl_model: Any, *, device: str = 'auto', layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None, device_id: int | None = None, handle: Handle | None = None) ForestInference[source]#

Load a Scikit-Learn forest model to nvForest

Parameters:
skl_model

The Scikit-Learn forest model to load.

device: {“auto”, “gpu”, “cpu”}, default=”auto”

Whether to use GPU or CPU for inferencing. If set to “auto”, GPU will be selected if it is available.

layout{“breadth_first”, “depth_first”, “layered”}, default=”depth_first”

The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.

default_chunk_sizeint or None, default=None

If set, predict calls without a specified chunk size will use this default value.

align_bytesint or None, default=None

Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.

precision{“single”, “double”, None}, default=”single”

Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from nvForest and the original training framework is of paramount importance.

device_idint or None, default=None

For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.

handlenvforest.Handle or None

For GPU execution, the nvForest handle containing the stream or stream pool to use during loading and inference. If not given, a new handle will be constructed.

nvforest.load_from_treelite_model(tl_model: Model, *, device: str = 'auto', layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None, device_id: int | None = None, handle: Handle | None = None) ForestInference[source]#

Load a Treelite forest model to nvForest

Parameters:
tl_model

The Treelite model to load.

device: {“auto”, “gpu”, “cpu”}, default=”auto”

Whether to use GPU or CPU for inferencing. If set to “auto”, GPU will be selected if it is available.

layout{“breadth_first”, “depth_first”, “layered”}, default=”depth_first”

The in-memory layout to be used during inference for nodes of the forest model. This parameter is available purely for runtime optimization. For performance-critical applications, it is recommended that available layouts be tested with realistic batch sizes to determine the optimal value.

default_chunk_sizeint or None, default=None

If set, predict calls without a specified chunk size will use this default value.

align_bytesint or None, default=None

Pad each tree with empty nodes until its in-memory size is a multiple of the given value. If None, use 0 for GPU and 64 for CPU.

precision{“single”, “double”, None}, default=”single”

Use the given floating point precision for evaluating the model. If None, use the native precision of the model. Note that single-precision execution is substantially faster than double-precision execution, so double-precision is recommended only for models trained and double precision and when exact conformance between results from nvForest and the original training framework is of paramount importance.

device_idint or None, default=None

For GPU execution, the device on which to load and execute this model. For CPU execution, this value is currently ignored.

handlenvforest.Handle or None

For GPU execution, the nvForest handle containing the stream or stream pool to use during loading and inference. If not given, a new handle will be constructed.

class nvforest.CPUForestInferenceClassifier(*, treelite_model: Model, handle: Handle | None = None, layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None)[source]#
Attributes:
align_bytes
default_chunk_size
device_id
layout
num_outputs
num_trees
precision

Methods

apply(X, *[, chunk_size])

Output the ID of the leaf node for each tree.

predict(X, *[, chunk_size, threshold])

Predict the class for each row.

predict_per_tree(X, *[, chunk_size])

Output prediction of each tree.

predict_proba(X, *[, chunk_size])

Predict the class probabilities for each row in X.

apply(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output the ID of the leaf node for each tree.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None, threshold: float | None = None) ndarray | cupy.ndarray[source]#

Predict the class for each row.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

threshold

For binary classifiers, output probabilities above this threshold will be considered positive detections. If None, a threshold of 0.5 will be used for binary classifiers. For multiclass classifiers, the highest probability class is chosen regardless of threshold.

predict_per_tree(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output prediction of each tree. This function computes one or more margin scores per tree.

Parameters:
X:

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict_proba(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Predict the class probabilities for each row in X.

Parameters:
X

The input data of shape Rows * Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

class nvforest.CPUForestInferenceRegressor(*, treelite_model: Model, handle: Handle | None = None, layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None)[source]#
Attributes:
align_bytes
default_chunk_size
device_id
layout
num_outputs
num_trees
precision

Methods

apply(X, *[, chunk_size])

Output the ID of the leaf node for each tree.

predict(X, *[, chunk_size])

Predict the output for each row.

predict_per_tree(X, *[, chunk_size])

Output prediction of each tree.

apply(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output the ID of the leaf node for each tree.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Predict the output for each row.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict_per_tree(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output prediction of each tree. This function computes one or more margin scores per tree.

Parameters:
X:

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

class nvforest.GPUForestInferenceClassifier(*, treelite_model: Model, handle: Handle | None = None, layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None, device_id: int)[source]#
Attributes:
align_bytes
default_chunk_size
device_id
layout
num_outputs
num_trees
precision

Methods

apply(X, *[, chunk_size])

Output the ID of the leaf node for each tree.

predict(X, *[, chunk_size, threshold])

Predict the class for each row.

predict_per_tree(X, *[, chunk_size])

Output prediction of each tree.

predict_proba(X, *[, chunk_size])

Predict the class probabilities for each row in X.

apply(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output the ID of the leaf node for each tree.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None, threshold: float | None = None) ndarray | cupy.ndarray[source]#

Predict the class for each row.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

threshold

For binary classifiers, output probabilities above this threshold will be considered positive detections. If None, a threshold of 0.5 will be used for binary classifiers. For multiclass classifiers, the highest probability class is chosen regardless of threshold.

predict_per_tree(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output prediction of each tree. This function computes one or more margin scores per tree.

Parameters:
X:

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict_proba(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Predict the class probabilities for each row in X.

Parameters:
X

The input data of shape Rows * Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

class nvforest.GPUForestInferenceRegressor(*, treelite_model: Model, handle: Handle | None = None, layout: str = 'depth_first', default_chunk_size: int | None = None, align_bytes: int | None = None, precision: str | None = None, device_id: int)[source]#
Attributes:
align_bytes
default_chunk_size
device_id
layout
num_outputs
num_trees
precision

Methods

apply(X, *[, chunk_size])

Output the ID of the leaf node for each tree.

predict(X, *[, chunk_size])

Predict the output for each row.

predict_per_tree(X, *[, chunk_size])

Output prediction of each tree.

apply(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output the ID of the leaf node for each tree.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Predict the output for each row.

Parameters:
X

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.

predict_per_tree(X: ndarray | cupy.ndarray, *, chunk_size: int | None = None) ndarray | cupy.ndarray[source]#

Output prediction of each tree. This function computes one or more margin scores per tree.

Parameters:
X:

The input data of shape Rows X Features. This can be a numpy array or cupy array. nvForest is optimized for C-major arrays (e.g. numpy/cupy arrays). Inputs whose datatype does not match the precision of the loaded model (float/double) will be converted to the correct datatype before inference. If this input is in a memory location that is inaccessible to the current device type (as set with the ‘device’ parameter in the constructor), it will be copied to the correct location. This copy will be distributed across as many CUDA streams as are available in the stream pool of the model’s RAFT handle.

chunk_size

The number of rows to simultaneously process in one iteration of the inference algorithm. Batches are further broken down into “chunks” of this size when assigning available threads to tasks. The choice of chunk size can have a substantial impact on performance, but the optimal choice depends on model and hardware and is difficult to predict a priori. In general, larger batch sizes benefit from larger chunk sizes, and smaller batch sizes benefit from small chunk sizes. On GPU, valid values are powers of 2 from 1 to 32. On CPU, valid values are any power of 2, but little benefit is expected above a chunk size of 512.