Scaling up hyperparameter optimization with NVIDIA DGX Cloud and XGBoost GPU algorithm#

Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune. In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on NVIDIA DGX Cloud.

Prerequisites#

See Documentation

Please follow instructions in NVIDIA DGX Cloud (Base Command Platform) to launch a Base Command Platform (BCP) job with RAPIDS.

Note

When configuring your cluster ensure you install optuna as we will use it later in the notebook.

Once your cluster is running and you have access to Jupyter save this notebook and run through the cells.

Connect to Dask cluster#

from dask.distributed import Client

client = Client("ws://localhost:8786")
client

Client

Client-e30a2ceb-d856-11ed-826a-8a2d4b69b538

Connection method: Direct
Dashboard: http://localhost:8787/status

Scheduler Info

Scheduler

Scheduler-be131a42-88e8-41ec-a14a-1be043f3f2bd

Comm: ws://100.96.148.133:8786 Workers: 14
Dashboard: http://100.96.148.133:8787/status Total threads: 14
Started: 4 minutes ago Total memory: 800.00 GiB

Workers

Worker: ws://100.96.148.133:33001

Comm: ws://100.96.148.133:33001 Total threads: 1
Dashboard: http://100.96.148.133:38395/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:34335
Local directory: /rapids/notebooks/dask-worker-space/worker-ls6hi4kc
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 405.94 MiB Spilled bytes: 0 B
Read bytes: 20.35 kiB Write bytes: 10.48 kiB

Worker: ws://100.96.148.133:35365

Comm: ws://100.96.148.133:35365 Total threads: 1
Dashboard: http://100.96.148.133:33337/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:44471
Local directory: /rapids/notebooks/dask-worker-space/worker-29et53zu
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 404.79 MiB Spilled bytes: 0 B
Read bytes: 16.02 kiB Write bytes: 10.17 kiB

Worker: ws://100.96.148.133:35445

Comm: ws://100.96.148.133:35445 Total threads: 1
Dashboard: http://100.96.148.133:42335/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:46039
Local directory: /rapids/notebooks/dask-worker-space/worker-y9qgj969
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 406.48 MiB Spilled bytes: 0 B
Read bytes: 27.82 kiB Write bytes: 12.72 kiB

Worker: ws://100.96.148.133:40871

Comm: ws://100.96.148.133:40871 Total threads: 1
Dashboard: http://100.96.148.133:38443/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:43913
Local directory: /rapids/notebooks/dask-worker-space/worker-gy9988od
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 406.17 MiB Spilled bytes: 0 B
Read bytes: 17.70 kiB Write bytes: 8.42 kiB

Worker: ws://100.96.148.133:41459

Comm: ws://100.96.148.133:41459 Total threads: 1
Dashboard: http://100.96.148.133:43263/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:39115
Local directory: /rapids/notebooks/dask-worker-space/worker-d3ao3_xc
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 406.29 MiB Spilled bytes: 0 B
Read bytes: 26.71 kiB Write bytes: 13.11 kiB

Worker: ws://100.96.148.133:44129

Comm: ws://100.96.148.133:44129 Total threads: 1
Dashboard: http://100.96.148.133:36615/status Memory: 66.67 GiB
Nanny: ws://100.96.148.133:41335
Local directory: /rapids/notebooks/dask-worker-space/worker-w2xb5fce
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 405.00 MiB Spilled bytes: 0 B
Read bytes: 10.42 kiB Write bytes: 4.83 kiB

Worker: ws://100.96.165.66:34765

Comm: ws://100.96.165.66:34765 Total threads: 1
Dashboard: http://100.96.165.66:46631/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:36555
Local directory: /rapids/notebooks/dask-worker-space/worker-bdtrqsih
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 403.67 MiB Spilled bytes: 0 B
Read bytes: 3.77 kiB Write bytes: 8.26 kiB

Worker: ws://100.96.165.66:38307

Comm: ws://100.96.165.66:38307 Total threads: 1
Dashboard: http://100.96.165.66:33575/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:42987
Local directory: /rapids/notebooks/dask-worker-space/worker-lfueidsj
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 405.98 MiB Spilled bytes: 0 B
Read bytes: 642.2379507483156 B Write bytes: 1.37 kiB

Worker: ws://100.96.165.66:39421

Comm: ws://100.96.165.66:39421 Total threads: 1
Dashboard: http://100.96.165.66:32905/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:43779
Local directory: /rapids/notebooks/dask-worker-space/worker-gkndrj_h
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 410.18 MiB Spilled bytes: 0 B
Read bytes: 2.87 kiB Write bytes: 5.58 kiB

Worker: ws://100.96.165.66:39479

Comm: ws://100.96.165.66:39479 Total threads: 1
Dashboard: http://100.96.165.66:33237/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:37105
Local directory: /rapids/notebooks/dask-worker-space/worker-0g5qv_br
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 408.19 MiB Spilled bytes: 0 B
Read bytes: 4.38 kiB Write bytes: 9.44 kiB

Worker: ws://100.96.165.66:41755

Comm: ws://100.96.165.66:41755 Total threads: 1
Dashboard: http://100.96.165.66:46327/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:44857
Local directory: /rapids/notebooks/dask-worker-space/worker-3bxagf78
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 4.0% Last seen: Just now
Memory usage: 406.00 MiB Spilled bytes: 0 B
Read bytes: 3.13 kiB Write bytes: 6.71 kiB

Worker: ws://100.96.165.66:42253

Comm: ws://100.96.165.66:42253 Total threads: 1
Dashboard: http://100.96.165.66:45909/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:43073
Local directory: /rapids/notebooks/dask-worker-space/worker-o1u0hz8e
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 405.79 MiB Spilled bytes: 0 B
Read bytes: 0.0 B Write bytes: 0.0 B

Worker: ws://100.96.165.66:45377

Comm: ws://100.96.165.66:45377 Total threads: 1
Dashboard: http://100.96.165.66:42355/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:33433
Local directory: /rapids/notebooks/dask-worker-space/worker-ycv933o4
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 0.0% Last seen: Just now
Memory usage: 405.77 MiB Spilled bytes: 0 B
Read bytes: 1.26 kiB Write bytes: 2.75 kiB

Worker: ws://100.96.165.66:46819

Comm: ws://100.96.165.66:46819 Total threads: 1
Dashboard: http://100.96.165.66:33287/status Memory: 50.00 GiB
Nanny: ws://100.96.165.66:46089
Local directory: /rapids/notebooks/dask-worker-space/worker-2h50aat7
GPU: Tesla V100-SXM2-32GB-LS GPU memory: 32.00 GiB
Tasks executing: 0 Tasks in memory: 0
Tasks ready: 0 Tasks in flight: 0
CPU usage: 2.0% Last seen: Just now
Memory usage: 406.49 MiB Spilled bytes: 0 B
Read bytes: 2.51 kiB Write bytes: 5.33 kiB
n_workers = len(client.scheduler_info()["workers"])

Perform hyperparameter optimization with a toy example#

Now we can run hyperparameter optimization. The workers will run multiple training jobs in parallel.

def objective(trial):
    x = trial.suggest_uniform("x", -10, 10)
    return (x - 2) ** 2
import optuna
from dask.distributed import wait

# Number of hyperparameter combinations to try in parallel
n_trials = 100

# Optimize in parallel on your Dask cluster
backend_storage = optuna.storages.InMemoryStorage()
dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
study = optuna.create_study(direction="minimize", storage=dask_storage)

futures = []
for i in range(0, n_trials, n_workers * 4):
    iter_range = (i, min([i + n_workers * 4, n_trials]))
    futures.append(
        {
            "range": iter_range,
            "futures": [
                client.submit(study.optimize, objective, n_trials=1, pure=False)
                for _ in range(*iter_range)
            ],
        }
    )
for partition in futures:
    iter_range = partition["range"]
    print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}")
    _ = wait(partition["futures"])
/tmp/ipykernel_618/3307148639.py:9: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future.
  dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
Testing hyperparameter combinations 0..56
Testing hyperparameter combinations 56..100
study.best_params
{'x': 1.9717191009722854}
study.best_value
0.0007998092498157874

Perform hyperparameter optimization with XGBoost GPU algorithm#

Now let’s try optimizing hyperparameters for an XGBoost model.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, KFold
import xgboost as xgb
from optuna.samplers import RandomSampler


def objective(trial):
    X, y = load_breast_cancer(return_X_y=True)
    params = {
        "n_estimators": 10,
        "verbosity": 0,
        "tree_method": "gpu_hist",
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 100.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 100.0, log=True),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        "max_depth": trial.suggest_int("max_depth", 2, 10, step=1),
        # minimum child weight, larger the term more conservative the tree.
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-8, 100, log=True
        ),
        "learning_rate": trial.suggest_float("learning_rate", 1e-8, 1.0, log=True),
        # defines how selective algorithm is.
        "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True),
        "grow_policy": "depthwise",
        "eval_metric": "logloss",
    }
    clf = xgb.XGBClassifier(**params)
    fold = KFold(n_splits=5, shuffle=True, random_state=0)
    score = cross_val_score(clf, X, y, cv=fold, scoring="neg_log_loss")
    return score.mean()
# Number of hyperparameter combinations to try in parallel
n_trials = 1000

# Optimize in parallel on your Dask cluster
backend_storage = optuna.storages.InMemoryStorage()
dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
study = optuna.create_study(
    direction="maximize", sampler=RandomSampler(seed=0), storage=dask_storage
)
futures = []
for i in range(0, n_trials, n_workers * 4):
    iter_range = (i, min([i + n_workers * 4, n_trials]))
    futures.append(
        {
            "range": iter_range,
            "futures": [
                client.submit(study.optimize, objective, n_trials=1, pure=False)
                for _ in range(*iter_range)
            ],
        }
    )
for partition in futures:
    iter_range = partition["range"]
    print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}")
    _ = wait(partition["futures"])
/tmp/ipykernel_618/4264989174.py:6: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future.
  dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
Testing hyperparameter combinations 0..56
Testing hyperparameter combinations 56..112
Testing hyperparameter combinations 112..168
Testing hyperparameter combinations 168..224
Testing hyperparameter combinations 224..280
Testing hyperparameter combinations 280..336
Testing hyperparameter combinations 336..392
Testing hyperparameter combinations 392..448
Testing hyperparameter combinations 448..504
Testing hyperparameter combinations 504..560
Testing hyperparameter combinations 560..616
Testing hyperparameter combinations 616..672
Testing hyperparameter combinations 672..728
Testing hyperparameter combinations 728..784
Testing hyperparameter combinations 784..840
Testing hyperparameter combinations 840..896
Testing hyperparameter combinations 896..952
Testing hyperparameter combinations 952..1000
study.best_params
{'lambda': 1.9471539598103378,
 'alpha': 1.1141784696858766e-08,
 'colsample_bytree': 0.7422532294369841,
 'max_depth': 4,
 'min_child_weight': 0.2248745054413427,
 'learning_rate': 0.4983200494234886,
 'gamma': 9.77293810275356e-07}
study.best_value
-0.10351123544715746

Let’s visualize the progress made by hyperparameter optimization.

from optuna.visualization.matplotlib import (
    plot_optimization_history,
    plot_param_importances,
)
plot_optimization_history(study)
/tmp/ipykernel_618/3324289224.py:1: ExperimentalWarning: plot_optimization_history is experimental (supported from v2.2.0). The interface can change in the future.
  plot_optimization_history(study)
<AxesSubplot:title={'center':'Optimization History Plot'}, xlabel='Trial', ylabel='Objective Value'>
../../../_images/f0799e200c20d2a18f55d3892d8db63a2f47114426d618b34a06054ebfc34934.png
plot_param_importances(study)
/tmp/ipykernel_618/3836449081.py:1: ExperimentalWarning: plot_param_importances is experimental (supported from v2.2.0). The interface can change in the future.
  plot_param_importances(study)
<AxesSubplot:title={'center':'Hyperparameter Importances'}, xlabel='Importance for Objective Value', ylabel='Hyperparameter'>
../../../_images/c1caa2300879344281e42c4b51c322ba08331346f57cb0ea3919939f1983f2d2.png