Getting Started with cuML’s accelerator mode (cuml.accel) in Snowflake Notebooks#

See Documentation

For the purpose of this example, follow the cuDF and cuML in Snowflake Notebooks ML Runtime section on Snowflake guide, before getting started.

Visit the documentation >>

cuML is a Python GPU library for accelerating machine learning models using a scikit-learn-like API.

cuML now has an accelerator mode (cuml.accel) which allows you to bring accelerated computing to existing workflows with zero code changes required. In addition to scikit-learn, cuml.accel also provides acceleration to algorithms found in umap-learn (UMAP) and hdbscan (HDBSCAN).

This notebook is a brief introduction to cuml.accel.

⚠️ Verify your setup#

First, we’ll verify that we are running on an NVIDIA GPU:

!nvidia-smi  # this should display information about available GPUs

With classical machine learning, there is a wide range of interesting problems we can explore. In this tutorial we’ll examine 3 of the more popular use cases: classification, clustering, and dimensionality reduction.

Classification#

Let’s load a dataset and see how we can use scikit-learn to classify that data. For this example we’ll use the Coverage Type dataset, which contains a number of features that can be used to predict forest cover type, such as elevation, aspect, slope, and soil-type.

More information on this dataset can be found at https://archive.ics.uci.edu/dataset/31/covertype.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

url = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"
)

# Column names for the dataset (from UCI Covertype description)
columns = [
    "Elevation",
    "Aspect",
    "Slope",
    "Horizontal_Distance_To_Hydrology",
    "Vertical_Distance_To_Hydrology",
    "Horizontal_Distance_To_Roadways",
    "Hillshade_9am",
    "Hillshade_Noon",
    "Hillshade_3pm",
    "Horizontal_Distance_To_Fire_Points",
    "Wilderness_Area1",
    "Wilderness_Area2",
    "Wilderness_Area3",
    "Wilderness_Area4",
    "Soil_Type1",
    "Soil_Type2",
    "Soil_Type3",
    "Soil_Type4",
    "Soil_Type5",
    "Soil_Type6",
    "Soil_Type7",
    "Soil_Type8",
    "Soil_Type9",
    "Soil_Type10",
    "Soil_Type11",
    "Soil_Type12",
    "Soil_Type13",
    "Soil_Type14",
    "Soil_Type15",
    "Soil_Type16",
    "Soil_Type17",
    "Soil_Type18",
    "Soil_Type19",
    "Soil_Type20",
    "Soil_Type21",
    "Soil_Type22",
    "Soil_Type23",
    "Soil_Type24",
    "Soil_Type25",
    "Soil_Type26",
    "Soil_Type27",
    "Soil_Type28",
    "Soil_Type29",
    "Soil_Type30",
    "Soil_Type31",
    "Soil_Type32",
    "Soil_Type33",
    "Soil_Type34",
    "Soil_Type35",
    "Soil_Type36",
    "Soil_Type37",
    "Soil_Type38",
    "Soil_Type39",
    "Soil_Type40",
    "Cover_Type",
]

data = pd.read_csv(url, header=None)
data.columns = columns

data.shape

Next, we’ll separate out the classification variable (Cover_Type) from the rest of the data. This is what we will aim to predict with our classification model. We can also split our dataset into training and test data using the scikit-learn train_test_split function.

X, y = data.drop("Cover_Type", axis=1), data["Cover_Type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now that we have our dataset split, we’re ready to run a model. To start, we will just run the model using the sklearn library with a starting max depth of 5 and all of the features. Note that we can set n_jobs=-1 to utilize all available CPU cores for fitting the trees – this will ensure we get the best performance possible on our system’s CPU.

import time

# Start timing cpu
start_time_cpu = time.time()

clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1)
clf.fit(X_train, y_train)

# End timing
end_time_cpu = time.time()

# Report CPU duration
print(f"CPU Training completed in {end_time_cpu - start_time_cpu:.2f} seconds")

In about 38 seconds, we were able to fit our tree model using scikit-learn. This is not bad! Let’s use the model we just trained to predict coverage types in our test dataset and take a look at the accuracy of our model.

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

We can also print out a full classification report to better understand how we predicted different Coverage_Type categories.

print(classification_report(y_test, y_pred))

With scikit-learn, we built a model that was able to be trained in just less than a minute. From the accuracy report, we can see that we predicted the correct class around 70% of the time, which is not bad but could certainly be improved.

Now let’s load cuml.accel and try running the same code again to see what kind of acceleration we can get.

import cuml.accel

cuml.accel.install()

IMPORTANT: After installing cuml.accel, we need to import the scikit-learn estimators we wish to use again.

from sklearn.ensemble import RandomForestClassifier

# Start timing gpu
start_time_gpu = time.time()

clf = RandomForestClassifier(n_estimators=100, max_depth=5, max_features=1.0, n_jobs=-1)
clf.fit(X_train, y_train)

# End timing
end_time_gpu = time.time()

# Report GPU duration
print(f"GPU Training completed in {end_time_gpu - start_time_gpu:.2f} seconds")

That was much faster! Using cuML we’re able to train this random forest model in just 3.5 seconds, that’s more than 10X speedup. One thing to note is that cuML’s implementation of RandomForestClassifier doesn’t utilize the n_jobs parameter like scikit-learn, but we still accept it which makes it easier to use this accelerator with zero code changes.

Let’s take a look at the same accuracy score and classification report to compare the model’s performance.

y_pred = clf.predict(X_test)
cr = classification_report(y_test, y_pred)
print(cr)

Out of the box, the model performed about the same as the scikit-learn implementation. Because this model ran so much faster, we can quickly iterate on the hyperparameter configuration and find a model that performs better with excellent speedups.

# Start timing gpu max_depth 30
start_time_gpu_md30 = time.time()

clf = RandomForestClassifier(
    n_estimators=100, max_depth=30, max_features=1.0, n_jobs=-1
)
clf.fit(X_train, y_train)

# End timing
end_time_gpu_md30 = time.time()

# Report GPU duration
print(
    f"GPU Training with max_depth=30 completed in {end_time_gpu_md30 - start_time_gpu_md30:.2f} seconds"
)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

We just run a model in a few seconds, and got a better accuracy. With a model that runs in just seconds, we can perform hyperparameter optimization using a method like the grid search shown above, and have results in just minutes instead of hours.

Resources#

For more information on getting started with cuml.accel, check out RAPIDS.ai or the cuML Docs.

Find more examples of usage in this cuml_sklearn_demo