Train and Hyperparameter-Tune with RAPIDS on AzureML#
Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune.
In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on Azure Machine Learning (AzureML) service.
Prerequisites#
See Documentation
Create an Azure ML Workspace then follow instructions in Microsoft Azure Machine Learning to launch an Azure ML Compute instance with RAPIDS.
Once your instance is running and you have access to Jupyter save this notebook and run through the cells.
Initialize Workspace#
Initialize MLClient
class to handle the workspace you created in the prerequisites step.
You can manually provide the workspace details or call MLClient.from_config(credential, path)
to create a workspace object from the details stored in config.json
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# Get a handle to the workspace.
#
# Azure ML places the workspace config at the default working
# directory for notebooks by default.
#
# If it isn't found, open a shell and look in the
# directory indicated by 'echo ${JUPYTER_SERVER_ROOT}'.
ml_client = MLClient.from_config(
credential=DefaultAzureCredential(),
path="./config.json",
)
Access Data from Datastore URI#
In this example, we will use 20 million rows of the airline dataset. The datastore uri below references a data storage location (path) containing the parquet files
datastore_name = "workspaceartifactstore"
dataset = "airline_20000000.parquet"
# Datastore uri format:
data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}"
print("data uri:", "\n", data_uri)
Create AML Compute#
You will need to create an Azure ML managed compute target (AmlCompute) to serve as the environment for training your model.
This notebook will use 10 nodes for hyperparameter optimization, you can modify max_instances
based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this article includes details on the default limits and how to request more quota.
size
describes the virtual machine type and size that will be used in the cluster. See “System Requirements” in the RAPIDS docs (link) and “GPU optimized virtual machine sizes” in the Azure docs (link) to identify an instance type.
Let’s create an AmlCompute
cluster of Standard_NC12s_v3
(Tesla V100) GPU VMs:
from azure.ai.ml.entities import AmlCompute
from azure.ai.ml.exceptions import MlException
# specify aml compute name.
target_name = "rapids-cluster"
try:
# let's see if the compute target already exists
gpu_target = ml_client.compute.get(target_name)
print(f"found compute target. Will use {gpu_target.name}")
except MlException:
print("Creating a new gpu compute target...")
gpu_target = AmlCompute(
name=target_name,
type="amlcompute",
size="STANDARD_NC12S_V3",
max_instances=5,
idle_time_before_scale_down=300,
)
ml_client.compute.begin_create_or_update(gpu_target).result()
print(
f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}"
)
Prepare training script#
Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided:
train_rapids.py
- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML.
We will log some parameters and metrics including highest accuracy, using mlflow within the training script:
import mlflow
mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))
These run metrics will become particularly important when we begin hyperparameter tuning our model in the ‘Tune model hyperparameters’ section.
Train Model on Remote Compute#
Setup Environment#
We’ll be using a custom RAPIDS docker image to setup the environment. This is available in rapidsai/base
repo on DockerHub.
%%bash # create a Dockerfile defining the image the code will run in cat > ./Dockerfile <<EOF FROM nvcr.io/nvidia/rapidsai/base:24.10-cuda12.5-py3.12 RUN conda install --yes -c conda-forge 'dask-ml>=2024.4.4' \ && pip install azureml-mlflow EOF
Make sure you have the correct path to the docker build context as os.getcwd()
.
import os
from azure.ai.ml.entities import BuildContext, Environment
env_docker_image = Environment(
build=BuildContext(path=os.getcwd()),
name="rapids-hpo",
description="RAPIDS environment with azureml-mlflow",
)
ml_client.environments.create_or_update(env_docker_image)
Submit the Training Job#
We will configure and run a training job using thecommand
class. The command can be used to run standalone jobs or as a function inside pipelines.
inputs
is a dictionary of command-line arguments to pass to the training script.
from azure.ai.ml import Input, command
command_job = command(
environment=f"{env_docker_image.name}:{env_docker_image.version}",
experiment_name="test_rapids_aml_hpo_cluster",
code=os.getcwd(),
inputs={
"data_dir": Input(type="uri_file", path=data_uri),
"n_bins": 32,
"compute": "single-GPU", # multi-GPU for algorithms via Dask
"cv_folds": 5,
"n_estimators": 100,
"max_depth": 6,
"max_features": 0.3,
},
command="python train_rapids.py \
--data_dir ${{inputs.data_dir}} \
--n_bins ${{inputs.n_bins}} \
--compute ${{inputs.compute}} \
--cv_folds ${{inputs.cv_folds}} \
--n_estimators ${{inputs.n_estimators}} \
--max_depth ${{inputs.max_depth}} \
--max_features ${{inputs.max_features}}",
compute=gpu_target.name,
)
# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)
# get a URL for the status of the job
returned_job.studio_url
Tune Model Hyperparameters#
We can optimize our model’s hyperparameters and improve the accuracy using Azure Machine Learning’s hyperparameter tuning capabilities.
Start a Hyperparameter Sweep#
Let’s define the hyperparameter space to sweep over. We will tune n_estimators
, max_depth
and max_features
parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize Accuracy
.
from azure.ai.ml.sweep import Choice, Uniform
command_job_for_sweep = command_job(
n_estimators=Choice(values=range(50, 500)),
max_depth=Choice(values=range(5, 19)),
max_features=Uniform(min_value=0.2, max_value=1.0),
)
# apply sweep parameter to obtain the sweep_job
sweep_job = command_job_for_sweep.sweep(
compute=gpu_target.name,
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
)
# Relax these limits to run more trials
sweep_job.set_limits(
max_total_trials=5, max_concurrent_trials=5, timeout=18000, trial_timeout=3600
)
# Specify your experiment details
sweep_job.display_name = "RF-rapids-sweep-job"
sweep_job.description = "Run RAPIDS hyperparameter sweep job"
This will launch the RAPIDS training script with parameters that were specified in the cell above.
# submit the hpo job
returned_sweep_job = ml_client.create_or_update(sweep_job)
Monitor runs#
print(f"Monitor your job at {returned_sweep_job.studio_url}")
Find and Register Best Model#
Download the best trial model output
ml_client.jobs.download(returned_sweep_job.name, output_name="model")
Delete Cluster#
ml_client.compute.begin_delete(gpu_target.name).wait()