HPO Benchmarking with RAPIDS and Dask#

Hyper-Parameter Optimization (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and time-consuming.

In the notebook demo below, we compare benchmarking results to show how GPU can accelerate HPO tuning jobs relative to CPU.

For instance, we find a 48x speedup in wall clock time (0.71 hrs vs 34.6 hrs) for XGBoost and 16x (3.86 hrs vs 63.2 hrs) for RandomForest when comparing between p3.8xlarge Tesla V100 GPUs and c5.24xlarge CPU EC2 instances on 100 HPO trials of the 3-year Airline Dataset.

Preamble

You can set up local environment but it is recommended to launch a Virtual Machine service (Azure, AWS, GCP, etc).

For the purposes of this notebook, we will be utilizing the Amazon Machine Image (AMI) as the starting point.

See Documentation

Please follow instructions in AWS Elastic Cloud Compute) to launch an EC2 instance with GPUs, the NVIDIA Driver and the NVIDIA Container Runtime.

Note

When configuring your instance ensure you select the Deep Learning AMI GPU TensorFlow or PyTorch in the AMI selection box under “Amazon Machine Image (AMI)”

Once your instance is running and you have access to Jupyter save this notebook and run through the cells.

Visit the documentation >>

Python ML Workflow

In order to work with RAPIDS container, the entrypoint logic should parse arguments, load, preprocess and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting.

Let’s have a step-by-step look at each stage of the ML workflow:

Dataset

We leverage the Airline dataset, which is a large public tracker of US domestic flight logs which we offer in various sizes (1 year, 3 year, and 10 year) and in Parquet (compressed column storage) format. The machine learning objective with this dataset is to predict whether flights will be more than 15 minutes late arriving to their destination.

We host the demo dataset in public S3 demo buckets in both the us-east-1 or us-west-2. To optimize performance, we recommend that you access the s3 bucket in the same region as your EC2 instance to reduce network latency and data transfer costs.

For this demo, we are using the 3_year dataset, which includes the following features to mention a few:

Date and distance ( Year, Month, Distance )
Airline / carrier ( Flight_Number_Reporting_Airline )
Actual departure and arrival times ( DepTime and ArrTime )
Difference between scheduled & actual times ( ArrDelay and DepDelay )
Binary encoded version of late, aka our target variable ( ArrDelay15 )

Configure aws credentials for access to S3 storage

aws configure

Download dataset from S3 bucket to your current working dir

aws s3 cp --recursive s3://sagemaker-rapids-hpo-us-west-2/3_year/ ./data/

Algorithm

From a ML/algorithm perspective, we offer XGBoost and RandomForest. You are free to switch between these algorithm choices and everything in the example will continue to work.

parser = argparse.ArgumentParser()
parser.add_argument(
    "--model-type", type=str, required=True, choices=["XGBoost", "RandomForest"]
)

We can also optionally increase robustness via reshuffles of the train-test split (i.e., cross-validation folds). Typical values are between 3 and 10 folds. We will use

n_cv_folds = 5

Dask Cluster

To maximize on efficiency, we launch a Dask LocalCluster for cpu or LocalCUDACluster that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster.

We can then ingest the data, and “persist” it in memory using dask as follows:

if args.mode == "gpu":
    cluster = LocalCUDACluster()
else: # mode == "cpu"
    cluster = LocalCluster(n_workers=os.cpu_count())

with Client(cluster) as client:
    dataset = ingest_data(mode=args.mode)
    client.persist(dataset)

Search Range

One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. In this notebook, we leverage the power of Optuna, a widely used Python library for hyperparameter optimization.

Here’s the quick steps on getting started with Optuna:

Define the Objective Function, which represents the model training and evaluation process. It takes hyperparameters as inputs and returns a metric to optimize (e.g, accuracy in our case,). Refer to train_xgboost() and train_randomforest() in hpo.py

Specify the search space using the Trial object’s methods to define the hyperparameters and their corresponding value ranges or distributions. For example:

"max_depth": trial.suggest_int("max_depth", 4, 8),
"max_features": trial.suggest_float("max_features", 0.1, 1.0),
"learning_rate": trial.suggest_float("learning_rate", 0.001, 0.1, log=True),
"min_samples_split": trial.suggest_int("min_samples_split", 2, 1000, log=True),

Create an Optuna study object to keep track of trials and their corresponding hyperparameter configurations and evaluation metrics.

study = optuna.create_study(
        sampler=RandomSampler(seed=args.seed), direction="maximize"
    )

Select an optimization algorithm to determine how Optuna explores and exploits the search space to find optimal configurations. For instance, the RandomSampler is an algorithm provided by the Optuna library that samples hyperparameter configurations randomly from the search space.
Run the Optimization by calling the Optuna’s optimize() function on the study object. You can specify the number of trials or number of parallel jobs to run.

 study.optimize(lambda trial: train_xgboost(
                    trial, dataset=dataset, client=client, mode=args.mode
                ),
                n_trials=100,
                n_jobs=1,
            )

Run HPO

Let’s try this out!

The example file hpo.py included here implements the patterns described above.

First make sure you have the correct CUDAtoolkit version by running nvidia-smi. See the RAPIDS installation docs (link) for details on the supported range of GPUs and drivers.

!nvidia-smi

Executing benchmark tests can be an arduous and time-consuming procedure that may extend over multiple days. By using a tool like tmux, you can maintain active terminal sessions, ensuring that your tasks continue running even if the SSH connection is interrupted.

tmux

Run the following to run hyper-parameter optimization in a Docker container.

If you don’t yet have that image locally, the first time this runs it might take a few minutes to pull it. After that, startup should be very fast.

Here’s what the arguments in that command below are doing:

--gpus all = make all GPUs on the system available to processes in the container
--env EXTRA_CONDA_PACKAGES = install optuna and optuna-integration conda packages
- the image already comes with all of the RAPIDS libraries and their dependencies installed
-p 8787:8787 = forward between port port 8787 on the host and 8787 on the container
- navigate to `{public IP of box}:8787 to see the Dask dashboard!
-v / -w = mount the current directory from the host machine into the container
- this allows processes in the container to read the data you downloaded to the ./data directory earlier
- it also means that any changes made to these files from inside the container will be reflected back on the host

Piping to a file called xgboost_hpo_logs.txt is helpful, as it preserves all the logs for later inspection.

!docker run \
    --gpus all \
    --env EXTRA_CONDA_PACKAGES="optuna optuna-integration" \
    -p 8787:8787 \
    -v $(pwd):/home/rapids/xgboost-hpo-example \
    -w /home/rapids/xgboost-hpo-example \
    -it nvcr.io/nvidia/rapidsai/base:25.02-cuda12.8-py3.12 \
    /bin/bash -c "python ./hpo.py --model-type 'XGBoost' --target 'gpu'" \
> ./xgboost_hpo_logs.txt 2>&1

Try Some Modifications

Now that you’ve run this example, try some modifications!

For example:

use --model-type "RandomForest" to see how a random forest model compares to XGBoost
use --target "cpu" to estimate the speedup from GPU-accelerated training
modify the pipeline in hpo.py with other customizations