Training and Evaluating Machine Learning Models#

This notebook explores several basic machine learning estimators in cuML, demonstrating how to train them and evaluate them with built-in metrics functions. All of the models are trained on synthetic data, generated by cuML’s dataset utilities.

  1. Random Forest Classifier

  2. UMAP

  3. DBSCAN

  4. Linear Regression

Shared Library Imports#

[1]:
import cuml
from cupy import asnumpy
from joblib import dump, load

1. Classification#

Random Forest Classification and Accuracy metrics#

The Random Forest algorithm classification model builds several decision trees, and aggregates each of their outputs to make a prediction. For more information on cuML’s implementation of the Random Forest Classification model please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier

Accuracy score is the ratio of correct predictions to the total number of predictions. It is used to measure the performance of classification models. For more information on the accuracy score metric please refer to: https://en.wikipedia.org/wiki/Accuracy_and_precision

For more information on cuML’s implementation of accuracy score metrics please refer to: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score

The cell below shows an end to end pipeline of the Random Forest Classification model. Here the dataset was generated by using sklearn’s make_classification dataset. The generated dataset was used to train and run predict on the model. Random forest’s performance is evaluated and then compared between the values obtained from the cuML and sklearn accuracy metrics.

[2]:
from cuml.datasets.classification import make_classification
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score

# synthetic dataset dimensions
n_samples = 1000
n_features = 10
n_classes = 2

# random forest depth and size
n_estimators = 25
max_depth = 10

# generate synthetic data [ binary classification task ]
X, y = make_classification ( n_classes = n_classes,
                             n_features = n_features,
                             n_samples = n_samples,
                             random_state = 0 )

X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )

model = cuRF( max_depth = max_depth,
              n_estimators = n_estimators,
              random_state  = 0 )

trained_RF = model.fit ( X_train, y_train )

predictions = model.predict ( X_test )

cu_score = cuml.metrics.accuracy_score( y_test, predictions )
sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )

print( " cuml accuracy: ", cu_score )
print( " sklearn accuracy : ", sk_score )

# save
dump( trained_RF, 'RF.model')

# to reload the model uncomment the line below
loaded_model = load('RF.model')
/opt/conda/envs/docs/lib/python3.12/site-packages/cuml/internals/api_decorators.py:344: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
  return func(**kwargs)
 cuml accuracy:  0.9959999918937683
 sklearn accuracy :  0.996

Clustering#

UMAP and Trustworthiness metrics#

UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization. For additional information on the UMAP model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.UMAP

Trustworthiness is a measure of the extent to which the local structure is retained in the embedding of the model. Therefore, if a sample predicted by the model lay within the unexpected region of the nearest neighbors, then those samples would be penalized. For more information on the trustworthiness metric please refer to: https://scikit-learn.org/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html

the documentation for cuML’s implementation of the trustworthiness metric is: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness

The cell below shows an end to end pipeline of UMAP model. Here, the blobs dataset is created by cuml’s equivalent of make_blobs function to be used as the input. The output of UMAP’s fit_transform is evaluated using the trustworthiness function. The values obtained by sklearn and cuml’s trustworthiness are compared below.

[3]:
from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import numpy as np

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs( n_samples = n_samples,
                               cluster_std = cluster_std,
                               n_features = n_features,
                               random_state = 0,
                               dtype=np.float32 )

trained_UMAP = cuUMAP( n_neighbors = 10 ).fit( X_blobs )
X_embedded = trained_UMAP.transform( X_blobs )

cu_score = cuml.metrics.trustworthiness( X_blobs, X_embedded )
sk_score = trustworthiness( asnumpy( X_blobs ),  asnumpy( X_embedded ) )

print(" cuml's trustworthiness score : ", cu_score )
print(" sklearn's trustworthiness score : ", sk_score )

# save
dump( trained_UMAP, 'UMAP.model')

# to reload the model uncomment the line below
# loaded_model = load('UMAP.model')
 cuml's trustworthiness score :  0.8491381048387097
 sklearn's trustworthiness score :  0.8491379032258064
[3]:
['UMAP.model']

DBSCAN and Adjusted Random Index#

DBSCAN is a popular and a powerful clustering algorithm. For additional information on the DBSCAN model please refer to the documentation on https://docs.rapids.ai/api/cuml/stable/api.html#cuml.DBSCAN

We create the blobs dataset using the cuml equivalent of make_blobs function.

Adjusted random index is a metric which is used to measure the similarity between two data clusters, and it is adjusted to take into consideration the chance grouping of elements. For more information on Adjusted random index please refer to: https://en.wikipedia.org/wiki/Rand_index

The cell below shows an end to end model of DBSCAN. The output of DBSCAN’s fit_predict is evaluated using the Adjusted Random Index function. The values obtained by sklearn and cuml’s adjusted random metric are compared below.

[4]:
from cuml.datasets import make_blobs
from cuml import DBSCAN as cumlDBSCAN
from sklearn.metrics import adjusted_rand_score
import numpy as np

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs( n_samples = n_samples,
                               n_features = n_features,
                               cluster_std = cluster_std,
                               random_state = 0,
                               dtype=np.float32 )

cuml_dbscan = cumlDBSCAN( eps = 3,
                          min_samples = 2)

trained_DBSCAN = cuml_dbscan.fit( X_blobs )

cu_y_pred = trained_DBSCAN.fit_predict ( X_blobs )

cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score( y_blobs, cu_y_pred )
sk_adjusted_rand_index = adjusted_rand_score( asnumpy(y_blobs), asnumpy(cu_y_pred) )

print(" cuml's adjusted random index score : ", cu_adjusted_rand_index)
print(" sklearn's adjusted random index score : ", sk_adjusted_rand_index)

# save and optionally reload
dump( trained_DBSCAN, 'DBSCAN.model')

# to reload the model uncomment the line below
# loaded_model = load('DBSCAN.model')
 cuml's adjusted random index score :  1.0
 sklearn's adjusted random index score :  1.0
[4]:
['DBSCAN.model']

Regression#

Linear regression and R^2 score#

Linear Regression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

R^2 score is also known as the coefficient of determination. It is used as a metric for scoring regression models. It scores the output of the model based on the proportion of total variation of the model. For more information on the R^2 score metrics please refer to: https://en.wikipedia.org/wiki/Coefficient_of_determination

For more information on cuML’s implementation of the r2 score metrics please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.regression.r2_score

The cell below uses the Linear Regression model to compare the results between cuML and sklearn trustworthiness metric. For more information on cuML’s implementation of the Linear Regression model please refer to : https://docs.rapids.ai/api/cuml/stable/api.html#linear-regression

[5]:
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split
from cuml.linear_model import LinearRegression as cuLR
from sklearn.metrics import r2_score

n_samples = 2**10
n_features = 100
n_info = 70

X_reg, y_reg = make_regression( n_samples = n_samples,
                                n_features = n_features,
                                n_informative = n_info,
                                random_state = 123 )

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split( X_reg,
                                                                     y_reg,
                                                                     train_size = 0.8,
                                                                     random_state = 10 )
cuml_reg_model = cuLR( fit_intercept = True,
                       normalize = True,
                       algorithm = 'eig' )

trained_LR = cuml_reg_model.fit( X_reg_train, y_reg_train )
cu_preds = trained_LR.predict( X_reg_test )

cu_r2 = cuml.metrics.r2_score( y_reg_test, cu_preds )
sk_r2 = r2_score( asnumpy( y_reg_test ), asnumpy( cu_preds ) )

print("cuml's r2 score : ", cu_r2)
print("sklearn's r2 score : ", sk_r2)

# save and reload
dump( trained_LR, 'LR.model')

# to reload the model uncomment the line below
# loaded_model = load('LR.model')
cuml's r2 score :  1.0
sklearn's r2 score :  1.0
/opt/conda/envs/docs/lib/python3.12/site-packages/cuml/internals/api_decorators.py:382: UserWarning: Starting from version 23.08, the new 'copy_X' parameter defaults to 'True', ensuring a copy of X is created after passing it to fit(), preventing any changes to the input, but with increased memory usage. This represents a change in behavior from previous versions. With `copy_X=False` a copy might still be created if necessary. Explicitly set 'copy_X' to either True or False to suppress this warning.
  return init_func(self, *args, **filtered_kwargs)
[5]:
['LR.model']