Zero Code Change Acceleration: Getting Started with cuml.accel#

The cuml.accel accelerator mode allows you to bring GPU-accelerated computing to existing machine learning workflows with zero code changes required. By simply loading the cuml.accel extension, your existing scikit-learn, UMAP, and HDBSCAN code can automatically leverage GPU acceleration for supported algorithms.

This notebook demonstrates how to use cuml.accel with practical examples across different machine learning tasks.

With classical machine learning, there is a wide range of interesting problems we can explore. In this brief introduction we demonstrate a typical classification workflow.

Classification Example#

Let’s load a dataset and see how we can use scikit-learn to classify that data. For this example we’ll use the Coverage Type dataset, which contains a number of features that can be used to predict forest cover type, such as elevation, aspect, slope, and soil-type.

More information on this dataset can be found at https://archive.ics.uci.edu/dataset/31/covertype.

[1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
[2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"

# Column names for the dataset (from UCI Covertype description)
columns = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
           'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
           'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3',
           'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6',
           'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13',
           'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
           'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27',
           'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34',
           'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Cover_Type']

data = pd.read_csv(url, header=None)
data.columns=columns
[3]:
data.shape
[3]:
(581012, 55)

Next, we’ll separate out the classification variable (Cover_Type) from the rest of the data. This is what we will aim to predict with our classification model. We can also split our dataset into training and test data using the scikit-learn train_test_split function.

[4]:
X, y = data.drop('Cover_Type', axis=1), data['Cover_Type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now that we have our dataset split, we’re ready to run a model. To start, we will just run the model using the sklearn library with a starting max depth of 5 and all of the features. Note that we can set n_jobs=-1 to utilize all available CPU cores for fitting the trees – this will ensure we get the best performance possible on our system’s CPU.

[5]:
%%time

clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1)
clf.fit(X_train, y_train)
CPU times: user 2min 42s, sys: 1.5 s, total: 2min 43s
Wall time: 14.5 s
[5]:
RandomForestClassifier(max_depth=5, max_features=1.0, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In about 2 minutes, we were able to fit our tree model using scikit-learn. This is not bad! Let’s use the model we just trained to predict coverage types in our test dataset and take a look at the accuracy of our model.

[6]:
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
[6]:
0.704663390790255

We can also print out a full classification report to better understand how we predicted different Coverage_Type categories.

[7]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           1       0.68      0.69      0.69     42381
           2       0.73      0.78      0.76     56719
           3       0.63      0.85      0.73      7153
           4       0.52      0.39      0.45       541
           5       0.48      0.05      0.08      1823
           6       0.71      0.03      0.06      3444
           7       0.78      0.43      0.55      4142

    accuracy                           0.70    116203
   macro avg       0.65      0.46      0.47    116203
weighted avg       0.70      0.70      0.69    116203

With scikit-learn, we built a model that was able to be trained in just a couple minutes. From the accuracy report, we can see that we predicted the correct class around 70% of the time, which is not bad but could certainly be improved.

Often we want to run several different random forest models in order to optimize our hyperparameters. For example, we may want to increase the number of estimators, or modify the maximum depth of our tree. When running dozens or hundreds of different hyperparameter combinations, things start to become quite slow and iteration takes a lot longer.

We provide some sample code utilizing GridSearchCV below to show what this process might look like. All of these combinations would take a LONG time to run if we spend 2 minutes fitting each model.

[8]:
"""
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
"""
[8]:
"\nfrom sklearn.model_selection import GridSearchCV\n\n# Define the parameter grid to search over\nparam_grid = {\n    'n_estimators': [50, 100, 200],\n    'max_depth': [None, 10, 20, 30],\n    'min_samples_split': [2, 5, 10],\n    'min_samples_leaf': [1, 2, 4],\n    'max_features': ['auto', 'sqrt', 'log2'],\n    'bootstrap': [True, False]\n}\n\ngrid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)\ngrid_search.fit(X_train, y_train)\n"

Now let’s load cuml.accel and try running the same code again to see what kind of acceleration we can get.

[9]:
%load_ext cuml.accel

After loading the IPython magic, we need to import the sklearn estimators we wish to use again.

[10]:
from sklearn.ensemble import RandomForestClassifier
[11]:
%%time

clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1)
clf.fit(X_train, y_train)
CPU times: user 5.55 s, sys: 3.96 s, total: 9.5 s
Wall time: 6.54 s
[11]:
RandomForestClassifier(max_depth=5, max_features=1.0, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

That was much faster! Using cuML we’re able to train this random forest model in just seconds instead of minutes. One thing to note is that cuML’s implementation of RandomForestClassifier doesn’t utilize the n_jobs parameter like scikit-learn, but we still accept it which makes it easier to use this accelerator with zero code changes.

Let’s take a look at the same accuracy score and classification report to compare the model’s performance.

[12]:
y_pred = clf.predict(X_test)
cr = classification_report(y_test, y_pred)
print(cr)
              precision    recall  f1-score   support

           1       0.68      0.69      0.68     42381
           2       0.73      0.78      0.76     56719
           3       0.64      0.85      0.73      7153
           4       0.66      0.44      0.53       541
           5       0.49      0.05      0.08      1823
           6       0.71      0.03      0.05      3444
           7       0.74      0.47      0.57      4142

    accuracy                           0.71    116203
   macro avg       0.66      0.47      0.49    116203
weighted avg       0.70      0.71      0.69    116203

Out of the box, the model performed about the same as the scikit-learn implementation. Because this model ran so much faster, we can quickly iterate on the hyperparameter configuration and find a model that performs better with excellent speedups.

[13]:
%%time

clf = RandomForestClassifier(n_estimators=100, max_depth=30, max_features=1.0, n_jobs=-1)
clf.fit(X_train, y_train)
CPU times: user 11.3 s, sys: 19.9 s, total: 31.2 s
Wall time: 9.75 s
[13]:
RandomForestClassifier(max_depth=30, max_features=1.0, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[14]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           1       0.97      0.96      0.97     42381
           2       0.97      0.98      0.97     56719
           3       0.96      0.96      0.96      7153
           4       0.90      0.88      0.89       541
           5       0.92      0.85      0.89      1823
           6       0.94      0.92      0.93      3444
           7       0.97      0.96      0.96      4142

    accuracy                           0.97    116203
   macro avg       0.95      0.93      0.94    116203
weighted avg       0.97      0.97      0.97    116203

With a model that runs in just seconds, we can perform hyperparameter optimization using a method like the grid search shown above, and have results in just minutes instead of hours.

CPU Fallback#

There are some algorithms and functionality from scikit-learn, UMAP, and HDBSCAN that are not implemented in cuML. For cases where the underlying functionality is not supported on GPU, the cuML accelerator will gracefully fall back and execute on the CPU instead.

For example, cuML’s RandomForest estimator does not support sparse inputs.

[15]:
%%time

from scipy import sparse

X_train_sparse = sparse.csr_matrix(X_train)

clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1)
clf.fit(X_train_sparse, y_train)
CPU times: user 3min 14s, sys: 425 ms, total: 3min 14s
Wall time: 17.8 s
[15]:
RandomForestClassifier(max_depth=5, max_features=1.0, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can see that the model took longer to fit because it had to fall back to the CPU, but the code executed without errors.

[16]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           1       0.68      0.69      0.69     42381
           2       0.73      0.78      0.76     56719
           3       0.63      0.85      0.73      7153
           4       0.53      0.39      0.45       541
           5       0.49      0.05      0.08      1823
           6       0.71      0.03      0.06      3444
           7       0.78      0.43      0.55      4142

    accuracy                           0.70    116203
   macro avg       0.65      0.46      0.47    116203
weighted avg       0.70      0.70      0.69    116203

/opt/conda/envs/docs/lib/python3.13/site-packages/sklearn/utils/validation.py:2742: UserWarning: X has feature names, but RandomForestClassifier was fitted without feature names
  warnings.warn(