Train and Hyperparameter-Tune with RAPIDS on AzureML#

Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune.

In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on Azure Machine Learning (AzureML) service.

Prerequisites#

See Documentation

Create an Azure ML Workspace then follow instructions in Microsoft Azure Machine Learning to launch an Azure ML Compute instance with RAPIDS.

Once your instance is running and you have access to Jupyter save this notebook and run through the cells.

Initialize Workspace#

Initialize MLClient class to handle the workspace you created in the prerequisites step.

You can manually provide the workspace details or call MLClient.from_config(credential, path) to create a workspace object from the details stored in config.json

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Get a handle to the workspace.
#
# Azure ML places the workspace config at the default working
# directory for notebooks by default.
#
# If it isn't found, open a shell and look in the
# directory indicated by 'echo ${JUPYTER_SERVER_ROOT}'.
ml_client = MLClient.from_config(
    credential=DefaultAzureCredential(),
    path="./config.json",
)

Access Data from Datastore URI#

In this example, we will use 20 million rows of the airline dataset. The datastore uri below references a data storage location (path) containing the parquet files

datastore_name = "workspaceartifactstore"
dataset = "airline_20000000.parquet"

# Datastore uri format:
data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}"

print("data uri:", "\n", data_uri)

Create AML Compute#

You will need to create an Azure ML managed compute target (AmlCompute) to serve as the environment for training your model.

This notebook will use 10 nodes for hyperparameter optimization, you can modify max_instances based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this article includes details on the default limits and how to request more quota.

size describes the virtual machine type and size that will be used in the cluster. See “System Requirements” in the RAPIDS docs (link) and “GPU optimized virtual machine sizes” in the Azure docs (link) to identify an instance type.

Let’s create an AmlCompute cluster of Standard_NC12s_v3 (Tesla V100) GPU VMs:

from azure.ai.ml.entities import AmlCompute
from azure.ai.ml.exceptions import MlException

# specify aml compute name.
target_name = "rapids-cluster"

try:
    # let's see if the compute target already exists
    gpu_target = ml_client.compute.get(target_name)
    print(f"found compute target. Will use {gpu_target.name}")
except MlException:
    print("Creating a new gpu compute target...")

    gpu_target = AmlCompute(
        name=target_name,
        type="amlcompute",
        size="STANDARD_NC12S_V3",
        max_instances=5,
        idle_time_before_scale_down=300,
    )
    ml_client.compute.begin_create_or_update(gpu_target).result()

    print(
        f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}"
    )

Prepare training script#

Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided:

train_rapids.py- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML.

We will log some parameters and metrics including highest accuracy, using mlflow within the training script:

import mlflow

mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))

These run metrics will become particularly important when we begin hyperparameter tuning our model in the ‘Tune model hyperparameters’ section.

Train Model on Remote Compute#

Setup Environment#

We’ll be using a custom RAPIDS docker image to setup the environment. This is available in rapidsai/base repo on DockerHub.

%%bash
# create a Dockerfile defining the image the code will run in
cat > ./Dockerfile <<EOF
FROM nvcr.io/nvidia/rapidsai/base:24.10-cuda12.5-py3.12

RUN conda install --yes -c conda-forge 'dask-ml>=2024.4.4' \
 && pip install azureml-mlflow
EOF

Make sure you have the correct path to the docker build context as os.getcwd().

import os

from azure.ai.ml.entities import BuildContext, Environment

env_docker_image = Environment(
    build=BuildContext(path=os.getcwd()),
    name="rapids-hpo",
    description="RAPIDS environment with azureml-mlflow",
)

ml_client.environments.create_or_update(env_docker_image)

Submit the Training Job#

We will configure and run a training job using thecommandclass. The command can be used to run standalone jobs or as a function inside pipelines. inputs is a dictionary of command-line arguments to pass to the training script.

from azure.ai.ml import Input, command

command_job = command(
    environment=f"{env_docker_image.name}:{env_docker_image.version}",
    experiment_name="test_rapids_aml_hpo_cluster",
    code=os.getcwd(),
    inputs={
        "data_dir": Input(type="uri_file", path=data_uri),
        "n_bins": 32,
        "compute": "single-GPU",  # multi-GPU for algorithms via Dask
        "cv_folds": 5,
        "n_estimators": 100,
        "max_depth": 6,
        "max_features": 0.3,
    },
    command="python train_rapids.py \
                    --data_dir ${{inputs.data_dir}} \
                    --n_bins ${{inputs.n_bins}} \
                    --compute ${{inputs.compute}} \
                    --cv_folds ${{inputs.cv_folds}} \
                    --n_estimators ${{inputs.n_estimators}} \
                    --max_depth ${{inputs.max_depth}} \
                    --max_features ${{inputs.max_features}}",
    compute=gpu_target.name,
)


# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)

# get a URL for the status of the job
returned_job.studio_url

Tune Model Hyperparameters#

We can optimize our model’s hyperparameters and improve the accuracy using Azure Machine Learning’s hyperparameter tuning capabilities.

Start a Hyperparameter Sweep#

Let’s define the hyperparameter space to sweep over. We will tune n_estimators, max_depth and max_features parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize Accuracy.

from azure.ai.ml.sweep import Choice, Uniform

command_job_for_sweep = command_job(
    n_estimators=Choice(values=range(50, 500)),
    max_depth=Choice(values=range(5, 19)),
    max_features=Uniform(min_value=0.2, max_value=1.0),
)

# apply sweep parameter to obtain the sweep_job
sweep_job = command_job_for_sweep.sweep(
    compute=gpu_target.name,
    sampling_algorithm="random",
    primary_metric="Accuracy",
    goal="Maximize",
)


# Relax these limits to run more trials
sweep_job.set_limits(
    max_total_trials=5, max_concurrent_trials=5, timeout=18000, trial_timeout=3600
)

# Specify your experiment details
sweep_job.display_name = "RF-rapids-sweep-job"
sweep_job.description = "Run RAPIDS hyperparameter sweep job"

This will launch the RAPIDS training script with parameters that were specified in the cell above.

# submit the hpo job
returned_sweep_job = ml_client.create_or_update(sweep_job)

Monitor runs#

print(f"Monitor your job at {returned_sweep_job.studio_url}")

Find and Register Best Model#

Download the best trial model output

ml_client.jobs.download(returned_sweep_job.name, output_name="model")

Delete Cluster#

ml_client.compute.begin_delete(gpu_target.name).wait()