Training and Evaluating Machine Learning Models#

This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML’s dataset utilities.

Random Forest Classifier
UMAP
DBSCAN
Linear Regression

Classification#

Random Forest Classification and Accuracy metrics#

The Random Forest classification model builds several decision trees, and aggregates each of their outputs to make a prediction. For more information on cuML’s implementation of the Random Forest Classification model please refer to: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier

Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance of classification models. For more information on the accuracy score metric please refer to: https://en.wikipedia.org/wiki/Accuracy_and_precision

For more information on cuML’s implementation of accuracy score metrics please refer to: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score

The cell below shows an end-to-end pipeline of the Random Forest Classification model. Here the dataset was generated using cuML’s make_classification dataset. The generated dataset was used to train and run predict on the model, and the performance is evaluated using cuML’s accuracy metric.

[1]:

from cuml.datasets.classification import make_classification
from cuml.ensemble import RandomForestClassifier
from cuml.metrics import accuracy_score
from cuml.model_selection import train_test_split

# Generate synthetic data (binary classification task)
X, y = make_classification(
    n_classes=2,
    n_features=10,
    n_samples=1000,
    random_state=0
)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Initialize and train the model
random_forest = RandomForestClassifier(
    max_depth=10, n_estimators=25, random_state=0
).fit(X_train, y_train)

# Make predictions
predictions = random_forest.predict(X_test)

# Evaluate performance
score = accuracy_score(y_test, predictions)
print("Accuracy: ", score)

Accuracy:  0.996

Clustering#

UMAP and Trustworthiness metrics#

UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization. For additional information on the UMAP model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP

Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lay within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html

The documentation for cuML’s implementation of the trustworthiness metric is: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness

The cell below shows an end-to-end pipeline of the UMAP model. Here, the blobs dataset is created using cuML’s make_blobs function to be used as the input. The output of UMAP’s fit_transform is evaluated using cuML’s trustworthiness function.

[2]:

from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP
from cuml.metrics import trustworthiness

# Generate synthetic blobs data
X_blobs, y_blobs = make_blobs(
    n_samples=1000,
    cluster_std=0.1,
    n_features=100,
    random_state=0,
)

# Initialize and train the UMAP model
umap = UMAP(n_neighbors=10).fit(X_blobs)

# Transform data to lower dimensions
X_embedded = umap.transform(X_blobs)

# Evaluate trustworthiness
score = trustworthiness(X_blobs, X_embedded)
print("Trustworthiness score: ", score)

Trustworthiness score:  0.8558887096774194

DBSCAN and Adjusted Random Index#

DBSCAN is a popular and a powerful clustering algorithm. For additional information on the DBSCAN model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.DBSCAN

We create the blobs dataset using cuML’s make_blobs function.

Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to take into consideration the chance grouping of elements. For more information on Adjusted random index please refer to: https://en.wikipedia.org/wiki/Rand_index

The cell below shows an end-to-end pipeline of the DBSCAN model. The output of DBSCAN’s fit_predict is evaluated using cuML’s Adjusted Random Index function.

[3]:

from cuml.cluster.dbscan import DBSCAN
from cuml.datasets import make_blobs
from cuml.metrics.cluster import adjusted_rand_score

# Generate synthetic blobs data
X_blobs, y_blobs = make_blobs(
    n_samples=1000,
    n_features=100,
    cluster_std=0.1,
    random_state=0,
)

# Initialize and train the DBSCAN model
dbscan = DBSCAN(eps=3, min_samples=2).fit(X_blobs)

# Get cluster predictions
predictions = dbscan.fit_predict(X_blobs)

# Evaluate clustering quality
score = adjusted_rand_score(y_blobs, predictions)
print("Adjusted random index score: ", score)

Adjusted random index score:  1.0

Regression#

Linear regression and R^2 score#

Linear Regression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

R^2 score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It scores the output of the model based on the proportion of total variation of the model. For more information on the R^2 score metrics please refer to: https://en.wikipedia.org/wiki/Coefficient_of_determination

For more information on cuML’s implementation of the r2 score metrics please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.regression.r2_score

The cell below uses the Linear Regression model and evaluates its performance using cuML’s R² score metric. For more information on cuML’s implementation of the Linear Regression model please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#linear-regression

[4]:

from cuml.datasets import make_regression
from cuml.linear_model import LinearRegression
from cuml.metrics import r2_score
from cuml.model_selection import train_test_split

# Generate synthetic regression data
X, y = make_regression(
    n_samples=2**10,
    n_features=100,
    n_informative=70,
    random_state=123,
)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

# Initialize and train the linear regression model
linear_regression = LinearRegression(
    fit_intercept=True, normalize=True, algorithm="eig"
).fit(X_train, y_train)

# Make predictions
predictions = linear_regression.predict(X_test)

# Evaluate performance
score = r2_score(y_test, predictions)
print("R² score: ", score)

/opt/conda/envs/docs/lib/python3.13/site-packages/cuml/linear_model/base.py:103: FutureWarning: The `normalize` option to `LinearRegression` was deprecated in 25.12 and will be removed in 26.02. Please use a `StandardScaler` to normalize your data external to `LinearRegression`.
  warnings.warn(

R² score:  1.0