Intro and key concepts for cuML¶
cuML accelerates machine learning on GPUs. The library follows a couple of key principles, and understanding these will help you take full advantage cuML.
1. Where possible, match the scikit-learn API¶
cuML estimators look and feel just like scikit-learn estimators. You
initialize them with key parameters, fit them with a fit
method,
then call predict
or transform
for inference.
import cuml.LinearRegression
model = cuml.LinearRegression()
model.fit(X_train, y)
y_prediction = model.predict(X_test)
You can find many more complete examples in the Introductory Notebook and in the cuML API documentation.
2. Accept flexible input types, return predictable output types¶
cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays,
2d PyTorch tensors, and really any kind of standards-based Python
array input you can throw at them. This relies on the __array__
and __cuda_array_interface__
standards, widely used throughout the
PyData community.
By default, outputs will mirror the data type you provided. So, if you
fit a model with a NumPy array, the model.coef_
property
containing fitted coefficients will also be a NumPy array. If you fit
a model using cuDF’s GPU-based DataFrame and Series objects, the
model’s output properties will be cuDF objects. You can always
override this behavior and select a default datatype with the
memory_utils.set_global_output_type
function.
The RAPIDS Configurable Input and Output Types blog post goes into much more detail explaining this approach.
3. Be fast!¶
cuML’s estimators rely on highly-optimized CUDA primitives and
algorithms within libcuml
. On a modern GPU, these can exceed the
performance of CPU-based equivalents by a factor of anything from 4x
(for a medium-sized linear regression) to over 1000x (for large-scale
tSNE dimensionality reduction). The cuml.benchmark module
provides an easy interface to benchmark your own hardware.
To maximize performance, keep in mind - a modern GPU can have over 5000 cores, so make sure you’re providing enough data to keep it busy! In many cases, performance advantages appear as the dataset grows.
Learn more¶
To get started learning cuML, walk through the Introductory Notebook. Then try out some of the other notebook
examples in the notebooks
directory of the repository. Finally, do
a deeper dive with the cuML blogs.