Introduction#

cuML accelerates machine learning on GPUs. The library follows a couple of key principles, and understanding these will help you take full advantage cuML.

1. Where possible, match the scikit-learn API#

cuML estimators look and feel just like scikit-learn estimators. You initialize them with key parameters, fit them with a fit method, then call predict or transform for inference.

import cuml.LinearRegression

model = cuml.LinearRegression()
model.fit(X_train, y)
y_prediction = model.predict(X_test)

You can find many more complete examples in the Introductory Notebook and in the cuML API documentation.

2. Accept flexible input types, return predictable output types#

cuML estimators can accept NumPy arrays, cuDF dataframes, cuPy arrays, 2d PyTorch tensors, and really any kind of standards-based Python array input you can throw at them. This relies on the __array__ and __cuda_array_interface__ standards, widely used throughout the PyData community.

By default, outputs will mirror the data type you provided. So, if you fit a model with a NumPy array, the model.coef_ property containing fitted coefficients will also be a NumPy array. If you fit a model using cuDF’s GPU-based DataFrame and Series objects, the model’s output properties will be cuDF objects. You can always override this behavior and select a default datatype with the memory_utils.set_global_output_type function.

The RAPIDS Configurable Input and Output Types blog post goes into much more detail explaining this approach.

3. Be fast!#

cuML’s estimators rely on highly-optimized CUDA primitives and algorithms within libcuml. On a modern GPU, these can exceed the performance of CPU-based equivalents by a factor of anything from 4x (for a medium-sized linear regression) to over 1000x (for large-scale tSNE dimensionality reduction). The cuml.benchmark module provides an easy interface to benchmark your own hardware.

To maximize performance, keep in mind - a modern GPU can have over 5000 cores, so make sure you’re providing enough data to keep it busy! In many cases, performance advantages appear as the dataset grows.

Learn more#

To get started learning cuML, walk through the Introductory Notebook. Then try out some of the other notebook examples in the notebooks directory of the repository. Finally, do a deeper dive with the cuML blogs.