Getting Started with Optuna and RAPIDS for HPO#

Hyperparameter optimization (HPO) automates the process of picking values for the hyperparameters of a machine learning algorithm to improve model performance. This can help boost the model accuracy, but can be resource-intensive, as it may require training the model for hundreds of hyperparameter combinations. Let’s take a look at how we can use Optuna and RAPIDS to make HPO less time-consuming.

RAPIDS#

The RAPIDS framework provides a suite of libraries to execute end-to-end data science pipelines entirely on GPUs. One of the libraries in this framework is cuML, which implements common machine learning models with a scikit-learn-compatible API and a GPU-accelerated backend. You can learn more about RAPIDS here.

Optuna#

Optuna is a lightweight framework for automatic hyperparameter optimization. It provides a define-by-run API, which makes it easy to adapt to any already existing code that we have and enables high modularity along with the flexibility to construct hyperparameter spaces dynamically. By simply wrapping the objective function with Optuna, we can perform a parallel-distributed HPO search over a search space as we’ll see in this notebook.

In this notebook, we’ll use BNP Paribas Cardif Claims Management dataset from Kaggle to predict if a claim will receive accelerated approval or not. We’ll explore how to use Optuna with RAPIDS in combination with Dask to run multi-GPU HPO experiments that can yield results faster than CPU.

## Run this cell to install optuna
#!pip install optuna optuna-integration

import cudf
import optuna
from cuml import LogisticRegression
from cuml.metrics import log_loss
from cuml.model_selection import train_test_split
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

Set up CUDA Cluster#

We start a local cluster and keep it ready for running distributed tasks with dask. The dask scheduler can help leverage multiple nodes available on the cluster.

LocalCUDACluster launches one Dask worker for each GPU in the current systems. It’s developed as a part of the RAPIDS project. Learn More:

# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1, ip="", dashboard_address="8081")
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
c

[I 2024-08-06 09:41:38,254] A new study created in memory with name: dask_optuna_lr_log_loss_tpe

Loading the data#

Data Acquisition#

Dataset can be acquired from Kaggle: BNP Paribas Cardif Claims Management. To download the dataset:

Follow the instructions here to: Set-up the Kaggle API
Run the following to download the data

mkdir -p ./data

kaggle competitions download \
  -c bnp-paribas-cardif-claims-management \
  --path ./data

unzip \
  -d ./data \
  ./data/bnp-paribas-cardif-claims-management.zip

This is an anonymized dataset containing categorical and numerical values for claims received by BNP Paribas Cardif. The “target” column in the train set is the variable to predict. It is equal to 1 for claims suitable for an accelerated approval. The task is to predict whether a claim will be suitable for accelerated approval or not. We’ll only use the train.csv.zip file as test.csv.zip does not have a target column.

import os

file_name = "train.csv.zip"

data_dir = "data/"
INPUT_FILE = os.path.join(data_dir, file_name)

Select the N_TRIALS for the number of runs of HPO trials.

N_TRIALS = 150

df = cudf.read_csv(INPUT_FILE)

# Drop ID column
df = df.drop("ID", axis=1)

# Drop non-numerical data and fill NaNs before passing to cuML RF
CAT_COLS = list(df.select_dtypes("object").columns)
df = df.drop(CAT_COLS, axis=1)
df = df.fillna(0)

df = df.astype("float32")
X, y = df.drop(["target"], axis=1), df["target"].astype("int32")

study_name = "dask_optuna_lr_log_loss_tpe"

Training and Evaluation#

The train_and_eval function accepts the different parameters to try out. This function should look very similar to any ML workflow. We’ll use this function within the Optuna objective function to show how easily we can fit an existing workflow into the Optuna work.

def train_and_eval(
    X_param, y_param, penalty="l2", C=1.0, l1_ratio=None, fit_intercept=True
):
    """
    Splits the given data into train and test split to train and evaluate the model
    for the params parameters.

    Params
    ______

    X_param:  DataFrame.
              The data to use for training and testing.
    y_param:  Series.
              The label for training
    penalty, C, l1_ratio, fit_intercept: The parameter values for Logistic Regression.

    Returns
    score: log loss of the fitted model
    """
    X_train, X_valid, y_train, y_valid = train_test_split(
        X_param, y_param, random_state=42
    )
    classifier = LogisticRegression(
        penalty=penalty,
        C=C,
        l1_ratio=l1_ratio,
        fit_intercept=fit_intercept,
        max_iter=10000,
    )
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_valid)
    score = log_loss(y_valid, y_pred)
    return score

For a baseline number, let’s see what the default performance of the model is.

print("Score with default parameters : ", train_and_eval(X, y))

[W] [09:34:11.132560] L-BFGS line search failed (code 3); stopping at the last valid step
Score with default parameters :  8.24908383066997

Objective Function#

We will optimize the objective function using Optuna Study. The objective function tries out specified values for the parameters that we are tuning and returns the score obtained with those parameters. These results will be aggregated in study.trials_dataframes().

Let’s define the objective function for this HPO task by making use of the train_and_eval(). You can see that we simply choose a value for the parameters and call the train_and_eval method, making Optuna very easy to use in an existing workflow.

The objective function does not need to be changed when switching to different samplers, which are built-in options in Optuna to enable the selection of different sampling algorithms that optuna provides. Some of the available ones include - GridSampler, RandomSampler, TPESampler, etc. We’ll use TPESampler for this demo, but feel free to try different samplers to notice the changes in performance.

Tree-Structured Parzen Estimators or TPE works by fitting two Gaussian Mixture Model during each trial - one to the set of parameter values associated with the best objective values, and another to the remaining parameter values. It chooses the parameter value that maximizes the ratio between the two GMMs

def objective(trial, X_param, y_param):
    C = trial.suggest_float("C", 0.01, 100.0, log=True)
    penalty = trial.suggest_categorical("penalty", ["none", "l1", "l2"])
    fit_intercept = trial.suggest_categorical("fit_intercept", [True, False])

    score = train_and_eval(
        X_param, y_param, penalty=penalty, C=C, fit_intercept=fit_intercept
    )
    return score

HPO Trials and Study#

Optuna uses studies and trials to keep track of the HPO experiments. Put simply, a trial is a single call of the objective function while a set of trials make up a study. We will pick the best observed trial from a study to get the best parameters that were used in that run.

Here, DaskStorage class is used to set up a storage shared by all workers in the cluster. Learn more about what storages can be used here

optuna.create_study is used to set up the study. As you can see, it specifies the study name, sampler to be used, the direction of the study, and the storage. With just a few lines of code, we have set up a distributed HPO experiment.

storage = optuna.integration.DaskStorage()
study = optuna.create_study(
    sampler=optuna.samplers.TPESampler(seed=142),
    study_name=study_name,
    direction="minimize",
    storage=storage,
)

# Optimize in parallel on your Dask cluster
#
# Submit `n_workers` optimization tasks, where each task runs about 40 optimization trials
# for a total of about N_TRIALS trials in all
futures = [
    c.submit(
        study.optimize,
        lambda trial: objective(trial, X, y),
        n_trials=N_TRIALS // n_workers,
        pure=False,
    )
    for _ in range(n_workers)
]
wait(futures)
print(f"Best params: {study.best_params}")

print("Number of finished trials: ", len(study.trials))

You should see logs like the following.

[I 2024-08-06 09:41:40,161] Trial 1 finished with value: 8.238207899472073 and parameters: {'C': 40.573838784392514, 'penalty': 'l2', 'fit_intercept': True}. Best is trial 1 with value: 8.238207899472073.
... 
[I 2024-08-06 09:41:58,423] Trial 143 finished with value: 8.210414278942531 and parameters: {'C': 0.3152731188939818, 'penalty': 'l1', 'fit_intercept': True}. Best is trial 52 with value: 8.205579602300705.

Best params: {'C': 1.486491072441749, 'penalty': 'l2', 'fit_intercept': True}
Number of finished trials:  144

Visualization#

Optuna provides an easy way to visualize the trials via builtin graphs. Read more about visualizations here.

Conluding Remarks#

This notebook shows how RAPIDS and Optuna can be used along with dask to run multi-GPU HPO jobs, and can be used as a starting point for anyone wanting to get started with the framework. We have seen how by just adding a few lines of code we were able to integrate the libraries for a muli-GPU HPO runs. This can also be scaled to multiple nodes.

Next Steps#

This is done on a small dataset, you are encouraged to test out on larger data with more range for the parameters too. These experiments can yield performance improvements. Refer to other examples in the rapidsai/cloud-ml-examples repository.

Resources#

Hyperparameter Tuning in Python

Overview of Hyperparameter tuning

How to make your model awesome with Optuna