Model Serialization and Persistence#

This notebook demonstrates how to save and load cuML models using various serialization methods, including pickle, joblib, and cross-platform deployment strategies.

Security Warning#

Only unpickle or deserialize models from trusted sources.

The pickle module (and by extension joblib) is not secure. Malicious pickle data can execute arbitrary code during deserialization, potentially compromising your entire system.

⚠️ Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

This warning applies to all serialization methods demonstrated in this notebook, including:

pickle.load() and pickle.loads()
joblib.load()
Any file-based model loading

For more information, see the Python pickle security documentation.

Single GPU Model Serialization#

All single-GPU cuML estimators support serialization using standard Python libraries. This section demonstrates:

Training a model on synthetic data
Saving the model using pickle and joblib
Loading the model for future use

Trained single-GPU models can also be used for distributed inference on Dask clusters, as shown in the Distributed Model Serialization section.

[1]:

from cuml.cluster import KMeans
from cuml.datasets import make_blobs

# Generate synthetic dataset for clustering
X, y = make_blobs(
    n_samples=50, n_features=10, centers=5, cluster_std=0.4, random_state=0
)
# Initialize and fit KMeans model
kmeans = KMeans(n_clusters=5).fit(X)

Recommendation: Use Pickle protocol 5 for better performance with large arrays and models. Protocol 5 provides significant speed improvements for NumPy arrays and cuML models with large parameter sets.

[2]:

import pickle

# Save the fitted model to disk
with open("kmeans_model.pkl", "wb") as output_file:
    pickle.dump(kmeans, output_file, protocol=5)

Important: The model can be restored using pickle, but requires the same cuML version used for training. If you need to load models across different cuML versions, consider using the scikit-learn conversion approach instead.

[3]:

# Load the model from disk
with open("kmeans_model.pkl", "rb") as input_file:
    kmeans_loaded_model = pickle.load(input_file)

# Display the loaded model's cluster centers
kmeans_loaded_model.cluster_centers_

[3]:

array([[-3.0099041 ,  4.276852  , -4.349393  ,  2.3583264 ,  1.6495332 ,
        -2.6076655 , -5.1556077 , -1.7364908 , -7.954574  ,  2.6284964 ],
       [-4.3782754 ,  5.673881  , -5.8093653 , -1.7014151 , -9.236413  ,
         0.67963   ,  4.479341  , -3.0451188 , -5.0462966 ,  9.735213  ],
       [ 4.9611325 ,  8.417621  , -9.079699  ,  9.357526  ,  8.626832  ,
        -1.2055869 ,  3.361048  , -7.826317  , -0.6502562 ,  0.46297705],
       [ 5.5782743 , -3.9382834 ,  4.119426  , -2.667007  ,  3.6268754 ,
        -4.263903  , -3.6590233 ,  6.3893604 , -7.029616  , -1.2048612 ],
       [-5.8457527 ,  2.362199  , -3.8079762 , -1.5414352 , -5.348189  ,
         7.406154  ,  2.9328313 ,  8.409007  ,  1.5656946 ,  1.231472  ]],
      dtype=float32)

Using joblib for Model Serialization#

joblib is an optimized alternative to pickle for machine learning models, offering:

Better performance for large NumPy arrays and cuML models
Efficient compression for models with many parameters
Memory mapping for faster loading of large models
Optimized serialization specifically designed for ML workloads

Note: While pickle and joblib files are often compatible, we recommend using the same library for both saving and loading to ensure reliability.

[4]:

import joblib

joblib.dump(kmeans, "kmeans_model.joblib")

[4]:

['kmeans_model.joblib']

Then reload the model with joblib.

[5]:

kmeans_loaded_model = joblib.load("kmeans_model.joblib")
kmeans_loaded_model.cluster_centers_

[5]:

array([[-3.0099041 ,  4.276852  , -4.349393  ,  2.3583264 ,  1.6495332 ,
        -2.6076655 , -5.1556077 , -1.7364908 , -7.954574  ,  2.6284964 ],
       [-4.3782754 ,  5.673881  , -5.8093653 , -1.7014151 , -9.236413  ,
         0.67963   ,  4.479341  , -3.0451188 , -5.0462966 ,  9.735213  ],
       [ 4.9611325 ,  8.417621  , -9.079699  ,  9.357526  ,  8.626832  ,
        -1.2055869 ,  3.361048  , -7.826317  , -0.6502562 ,  0.46297705],
       [ 5.5782743 , -3.9382834 ,  4.119426  , -2.667007  ,  3.6268754 ,
        -4.263903  , -3.6590233 ,  6.3893604 , -7.029616  , -1.2048612 ],
       [-5.8457527 ,  2.362199  , -3.8079762 , -1.5414352 , -5.348189  ,
         7.406154  ,  2.9328313 ,  8.409007  ,  1.5656946 ,  1.231472  ]],
      dtype=float32)

Distributed Model Serialization#

When working with distributed cuML models using Dask, the distributed estimator wrappers in cuml.dask are not designed to be pickled directly. Instead, cuML provides a specialized workflow:

Workflow Steps#

Extract the combined model: Use get_combined_model() to extract a single-GPU version of the trained distributed model
Serialize the combined model: Save the extracted model using pickle or joblib (same as any cuML model)
Flexible inference: Use the saved model in multiple ways:
- Single-GPU inference: Load directly for single-GPU predictions
- Distributed inference: Use ParallelPostFit from Dask-ML to distribute inference across a Dask cluster

This approach allows you to choose the optimal resources for both training and inference phases.

[6]:

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

# Set up Dask cluster
cluster = LocalCUDACluster()
client = Client(cluster)

[7]:

from cuml.dask.datasets import make_blobs
from cuml.dask.cluster import KMeans as DistributedKMeans

# Get number of workers
n_workers = client.scheduler_info()["n_workers"]

# Generate distributed dataset
X, y = make_blobs(
    n_samples=5000,
    n_features=30,
    centers=5,
    cluster_std=0.4,
    random_state=0,
    # 5 parts per worker to demonstrate distributed inference
    n_parts=n_workers * 5,
)

# Initialize and train the distributed KMeans model
distributed_kmeans = DistributedKMeans(n_clusters=5).fit(X)

Now we can save it with pickle like before, but we have to combine it into a non-distributed model first.

[8]:

# Extract single-GPU model and save it
combined_kmeans = distributed_kmeans.get_combined_model()

with open("kmeans_model.pkl", "wb") as output_file:
    pickle.dump(combined_kmeans, output_file, protocol=5)

And we can reload this model just like before.

[9]:

# Load the single-GPU model
with open("kmeans_model.pkl", "rb") as input_file:
    combined_kmeans_loaded_model = pickle.load(input_file)

# Display the first 3 rows of the loaded model's cluster centers
combined_kmeans_loaded_model.cluster_centers_[:3]

[9]:

array([[ 4.8197994 ,  8.427121  , -9.216296  ,  9.382446  ,  8.491983  ,
        -1.0688543 ,  3.334123  , -7.809992  , -0.5915905 ,  0.26288542,
         5.514851  , -4.099182  ,  4.2884808 , -2.842087  ,  3.6343296 ,
        -4.1257997 , -3.5992353 ,  6.211793  , -6.920471  , -1.0867234 ,
        -5.840083  ,  2.2231956 , -3.8685322 , -1.7113882 , -5.3163567 ,
         7.6082354 ,  2.8888006 ,  8.524052  ,  1.5774297 ,  1.0878682 ],
       [-2.8834434 ,  4.4329147 , -4.4221015 ,  2.3741193 ,  1.7598952 ,
        -2.500677  , -5.195792  , -1.7164668 , -8.128957  ,  2.6508052 ,
        -4.2964835 ,  5.578642  , -5.7345295 , -1.749918  , -9.330766  ,
         0.7207611 ,  4.4164    , -2.928805  , -4.9428267 ,  9.708277  ,
         8.405027  , -6.253137  , -6.3673425 ,  1.9556513 ,  4.1646323 ,
        -9.158217  ,  4.611632  ,  8.79964   ,  6.865229  ,  2.2170906 ],
       [-6.942454  , -9.765982  , -6.5323243 , -0.43099803,  6.116464  ,
         3.72713   , -3.9549985 ,  6.1494184 , -1.8595839 ,  5.033421  ,
        -6.873248  ,  1.3255422 ,  9.020514  , -0.9951003 ,  9.649452  ,
         9.786568  , -8.636811  ,  5.990376  ,  2.1981692 , -3.6432    ,
         7.0768757 , -7.367705  , -5.299161  , -6.97071   , -7.934944  ,
         6.681906  , -5.602813  ,  7.1699843 ,  6.5931764 , -8.328606  ]],
      dtype=float32)

Converting Between cuML and scikit-learn Models#

Many cuML estimators provide as_sklearn() and from_sklearn() methods for seamless conversion between cuML and scikit-learn formats.

Use Cases#

Cross-platform deployment: Train on GPU systems, deploy on CPU-only machines
Maximum compatibility: Use standard scikit-learn serialization tools
Hybrid workflows: Mix cuML and scikit-learn in the same pipeline
Legacy integration: Convert existing scikit-learn models to cuML for GPU acceleration

This approach eliminates the need to install cuML on deployment machines while maintaining model compatibility.

[10]:

import pickle

from cuml.cluster import KMeans
from cuml.datasets import make_blobs
from cuml.metrics.cluster import adjusted_rand_score

# Generate synthetic dataset for clustering
X, y = make_blobs(
    n_samples=1000, n_features=20, centers=5, cluster_std=0.5, random_state=42
)

# Train cuML KMeans
kmeans = KMeans(n_clusters=5, random_state=42).fit(X)

# Make predictions with cuML model
predictions = kmeans.predict(X)
score = adjusted_rand_score(y, predictions)
print(f"cuML KMeans ARI score: {score:.4f}")
print(f"cuML KMeans cluster centers shape: {kmeans.cluster_centers_.shape}")

cuML KMeans ARI score: 1.0000
cuML KMeans cluster centers shape: (5, 20)

We can convert this cuML model into a native scikit-learn estimator using the as_sklearn() method. This enables standard scikit-learn serialization and deployment on any Python environment.

[11]:

# Convert cuML model to scikit-learn model
kmeans_sklearn = kmeans.as_sklearn()
print(f"Converted to scikit-learn model: {type(kmeans_sklearn)}")

# Save scikit-learn model to disk
pickle.dump(kmeans_sklearn, open("kmeans_model_sklearn.pkl", "wb"), protocol=5)
print("scikit-learn KMeans model saved with pickle")

Converted to scikit-learn model: <class 'sklearn.cluster._kmeans.KMeans'>
scikit-learn KMeans model saved with pickle

The pickled scikit-learn model can be loaded and executed on any Python environment with only scikit-learn installed – no cuML or GPU required.

⚠️ Security reminder: Only load pickle files from trusted sources. See Security Warning above.

[12]:

from cupy import asnumpy

# Load scikit-learn model and verify prediction quality
kmeans_loaded_sklearn = pickle.load(open("kmeans_model_sklearn.pkl", "rb"))
sklearn_predictions = kmeans_loaded_sklearn.predict(asnumpy(X))
sklearn_score = adjusted_rand_score(y, sklearn_predictions)
print(f"Loaded sklearn KMeans ARI score: {sklearn_score:.4f}")

Loaded sklearn KMeans ARI score: 1.0000

You can also reconstruct a cuML model from a scikit-learn model using from_sklearn(). This is particularly useful for:

Pre-trained models: Convert existing scikit-learn models for GPU acceleration
Performance optimization: Run faster inference on GPU hardware
Hybrid workflows: Switch between CPU and GPU execution as needed

[13]:

# Re-construct the cuML model from the scikit-learn model
kmeans_from_sklearn = KMeans.from_sklearn(kmeans_loaded_sklearn)
predictions = kmeans_from_sklearn.predict(X)
print("Re-constructed cuML KMeans ARI Score: ", adjusted_rand_score(y, predictions))

Re-constructed cuML KMeans ARI Score:  1.0

Exporting Random Forest Models for CPU-Only Deployment#

You can export cuML Random Forest models for deployment on machines without NVIDIA GPUs using the Treelite library.

Benefits#

CPU-only deployment: Run trained models on any machine
Optimized inference: Treelite provides highly optimized CPU inference
Small footprint: No cuML or GPU dependencies required
Production ready: Efficient serialization and fast loading

Export Process#

Convert to Treelite format: Use as_treelite() to transform your cuML Random Forest model
Serialize the model: Call .serialize() to create a portable checkpoint file
Deploy anywhere: Install Treelite on the target machine and load the model for inference

[14]:

import numpy as np
from cuml.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load and prepare iris dataset
X, y = load_iris(return_X_y=True)
X, y = X.astype(np.float32), y.astype(np.int32)

# Train Random Forest model
random_forest = RandomForestClassifier(
    max_depth=3, random_state=0, n_estimators=10
).fit(X, y)

# Export cuML RF model as Treelite checkpoint
treelite_checkpoint_path = "./checkpoint.tl"
random_forest.as_treelite().serialize(treelite_checkpoint_path)

Deployment Steps#

Copy the checkpoint file: Transfer checkpoint.tl to your target machine
Install Treelite: Run pip install treelite or conda install -c conda-forge treelite
- No NVIDIA GPUs required
- No cuML installation needed
Load and use the model: Run the code below on the target machine

[15]:

import treelite

# Load the Treelite model (checkpoint file has been copied over)
treelite_checkpoint_path = "./checkpoint.tl"
treelite_model = treelite.Model.deserialize(treelite_checkpoint_path)

# Make predictions using Treelite
predictions = treelite.gtil.predict(treelite_model, X, pred_margin=True)

Exporting to ONNX#

cuML models can be exported to the ONNX format using sklearn-onnx. Use as_sklearn() to convert the cuML model to a scikit-learn estimator, then pass it to skl2onnx.convert_sklearn(). The resulting .onnx file can be loaded with ONNX Runtime for inference on both CPU and GPU, with no cuML dependency at inference time.

Not all estimators are supported by sklearn-onnx — see the supported model list for details. UMAP, HDBSCAN, and some other estimators are not supported.

Under cuml.accel, proxy objects are recognized by sklearn-onnx directly without needing as_sklearn(). See the cuml.accel ONNX example for details.

Note: due to a known upstream incompatibility in onnxruntime version 1.26.0, predictions from a serialized RandomForestClassifier will be incorrect (see microsoft/onnxruntime#28557). We recommend using version 1.25.1 until the issue is resolved.

Export Process#

Train with cuML, convert to scikit-learn with as_sklearn(), then export to ONNX.

[16]:

import numpy as np
from cuml.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

# Load and prepare dataset (ONNX requires float32 input)
X, y = load_iris(return_X_y=True)
X = X.astype(np.float32)
y = y.astype(np.int32)

# Train with cuML
clf = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=0)
clf.fit(X, y)

# Convert to scikit-learn, then to ONNX.
# zipmap=False returns class probabilities as a 2D array instead of a list
# of dicts. This option only applies to classifiers with predict_proba.
sklearn_clf = clf.as_sklearn()
initial_type = [("float_input", FloatTensorType([None, X.shape[1]]))]
onnx_model = convert_sklearn(
    sklearn_clf, initial_types=initial_type, options={"zipmap": False}
)

onnx_path = "./rf_classifier.onnx"
with open(onnx_path, "wb") as f:
    f.write(onnx_model.SerializeToString())

print(f"ONNX model saved to {onnx_path}")

ONNX model saved to ./rf_classifier.onnx

Inference with ONNX Runtime#

The saved .onnx file can be loaded on any machine with onnxruntime installed — no cuML or sklearn-onnx needed. Use onnxruntime for CPU inference, or onnxruntime-gpu for GPU inference.

Note: due to a known upstream incompatibility in onnxruntime version 1.26.0, predictions from a serialized RandomForestClassifier will be incorrect (see microsoft/onnxruntime#28557). We recommend using version 1.25.1 until the issue is resolved.

[17]:

import onnxruntime as ort

sess = ort.InferenceSession("./rf_classifier.onnx")
input_name = sess.get_inputs()[0].name
onnx_results = sess.run(None, {input_name: X})

onnx_predictions = onnx_results[0]
onnx_probabilities = onnx_results[1]

print(f"Predictions shape: {onnx_predictions.shape}")
print(f"Probabilities shape: {onnx_probabilities.shape}")
print(f"First 5 predictions: {onnx_predictions[:5]}")

Predictions shape: (150,)
Probabilities shape: (150, 3)
First 5 predictions: [0 0 0 0 0]