Train and Hyperparameter-Tune with RAPIDS#
Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune.
In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on Azure Machine Learning (AzureML) service.
Prerequisites#
See Documentation
Create an Azure ML Workspace then follow instructions in Microsoft Azure Machine Learning to launch an Azure ML Compute instance with RAPIDS.
Once your instance is running and you have access to Jupyter save this notebook and run through the cells.
# verify Azure ML SDK version
%pip show azure-ai-ml
Name: azure-ai-ml
Version: 1.8.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/rapids/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by:
Note: you may need to restart the kernel to use updated packages.
Initialize Workspace#
InitializeMLClient
class to handle the workspace you created in the prerequisites step.
You can manually provide the workspace details or call MLClient.from_config(credential, path)
to create a workspace object from the details stored in config.json
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
# Get a handle to the workspace
ml_client = MLClient(
credential=DefaultAzureCredential(),
subscription_id="fc4f4a6b-4041-4b1c-8249-854d68edcf62",
resource_group_name="rapidsai-deployment",
workspace_name="rapids-aml-cluster",
)
print(
"Workspace name: " + ml_client.workspace_name,
"Subscription id: " + ml_client.subscription_id,
"Resource group: " + ml_client.resource_group_name,
sep="\n",
)
Workspace name: rapids-aml-cluster
Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62
Resource group: rapidsai-deployment
Access Data from Datastore URI#
In this example, we will use 20 million rows of the airline dataset. The datastore uri below references a data storage location (path) containing the parquet files
datastore_name = "workspaceartifactstore"
dataset = "airline_20000000.parquet"
# Datastore uri format:
data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}"
print("data uri:", "\n", data_uri)
data uri:
azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet
Create AML Compute#
You will need to create an Azure ML managed compute target (AmlCompute) to serve as the environment for training your model.
This notebook will use 10 nodes for hyperparameter optimization, you can modify max_instances
based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this article includes details on the default limits and how to request more quota.
size
describes the virtual machine type and size that will be used in the cluster. See “System Requirements” in the RAPIDS docs (link) and “GPU optimized virtual machine sizes” in the Azure docs (link) to identify an instance type.
Let’s create an AmlCompute
cluster of Standard_NC12s_v3
(Tesla V100) GPU VMs:
from azure.ai.ml.entities import AmlCompute
from azure.ai.ml.exceptions import MlException
# specify aml compute name.
gpu_compute_target = "rapids-cluster"
try:
# let's see if the compute target already exists
gpu_target = ml_client.compute.get(gpu_compute_target)
print(f"found compute target. Will use {gpu_compute_target}")
except MlException:
print("Creating a new gpu compute target...")
gpu_target = AmlCompute(
name="rapids-cluster",
type="amlcompute",
size="STANDARD_NC12S_V3",
max_instances=5,
idle_time_before_scale_down=300,
)
ml_client.compute.begin_create_or_update(gpu_target).result()
print(
f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}"
)
found compute target. Will use rapids-cluster
Prepare training script#
Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided:
train_rapids.py
- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML.
We will log some parameters and metrics including highest accuracy, using mlflow within the training script:
import mlflow
mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))
These run metrics will become particularly important when we begin hyperparameter tuning our model in the ‘Tune model hyperparameters’ section.
rapids_script = "./train_rapids.py"
azure_script = "./rapids_csp_azure.py"
Train Model on Remote Compute#
Create Experiment#
Track all the runs in your workspace
experiment_name = "test_rapids_aml_cluster"
Setup Environment#
We’ll be using a custom RAPIDS docker image to [setup the environment]((https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image). This is available in rapidsai/rapidsai
repo on DockerHub.
Make sure you have the correct path to the docker build context as os.getcwd()
,
# RUN THIS CODE ONCE TO SETUP ENVIRONMENT
import os
from azure.ai.ml.entities import BuildContext, Environment
env_docker_image = Environment(
build=BuildContext(path=os.getcwd()),
name="rapids-mlflow",
description="RAPIDS environment with azureml-mlflow",
)
ml_client.environments.create_or_update(env_docker_image)
Uploading code (0.33 MBs): 100%|██████████| 325450/325450 [00:00<00:00, 2363322.62it/s]
Environment({'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'rapids-mlflow', 'description': 'RAPIDS environment with azureml-mlflow', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourceGroups/rapidsai-deployment/providers/Microsoft.MachineLearningServices/workspaces/rapids-aml-cluster/environments/rapids-mlflow/versions/10', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/skirui1/code', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f9ce47101f0>, 'serialize': <msrest.serialization.Serializer object at 0x7f9ce4710d30>, 'version': '10', 'latest_version': None, 'conda_file': None, 'image': None, 'build': <azure.ai.ml.entities._assets.environment.BuildContext object at 0x7f9ce4713580>, 'inference_config': None, 'os_type': 'Linux', 'arm_type': 'environment_version', 'conda_file_path': None, 'path': None, 'datastore': None, 'upload_hash': None, 'translated_conda_file': None})
Submit the Training Job#
We will configure and run a training job using thecommand
class. The command can be used to run standalone jobs or as a function inside pipelines.
inputs
is a dictionary of command-line arguments to pass to the training script.
from azure.ai.ml import Input, command
command_job = command(
environment="rapids-mlflow:1",
experiment_name=experiment_name,
code=os.getcwd(),
inputs={
"data_dir": Input(type="uri_file", path=data_uri),
"n_bins": 32,
"compute": "single-GPU", # multi-GPU for algorithms via Dask
"cv_folds": 5,
"n_estimators": 100,
"max_depth": 6,
"max_features": 0.3,
},
command=(
"python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} "
"--compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}} --n_estimators ${{inputs.n_estimators}} "
"--max_depth ${{inputs.max_depth}} --max_features ${{inputs.max_features}}"
),
compute="rapids-cluster",
)
# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)
# get a URL for the status of the job
returned_job.studio_url
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Uploading code (0.33 MBs): 100%|██████████| 327210/327210 [00:00<00:00, 1802654.05it/s]
'https://ml.azure.com/runs/zen_eye_lm7dcp68jz?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a'
Tune Model Hyperparameters#
We can optimize our model’s hyperparameters and improve the accuracy using Azure Machine Learning’s hyperparameter tuning capabilities.
Start a Hyperparameter Sweep#
Let’s define the hyperparameter space to sweep over. We will tune n_estimators
, max_depth
and max_features
parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize Accuracy
.
from azure.ai.ml.sweep import Choice, Uniform
command_job_for_sweep = command_job(
n_estimators=Choice(values=range(50, 500)),
max_depth=Choice(values=range(5, 19)),
max_features=Uniform(min_value=0.2, max_value=1.0),
)
# apply sweep parameter to obtain the sweep_job
sweep_job = command_job_for_sweep.sweep(
compute="rapids-cluster",
sampling_algorithm="random",
primary_metric="Accuracy",
goal="Maximize",
)
# Define the limits for this sweep
sweep_job.set_limits(
max_total_trials=10, max_concurrent_trials=2, timeout=18000, trial_timeout=3600
)
# Specify your experiment details
sweep_job.display_name = "RF-rapids-sweep-job"
sweep_job.description = "Run RAPIDS hyperparameter sweep job"
This will launch the RAPIDS training script with parameters that were specified in the cell above.
# submit the hpo job
returned_sweep_job = ml_client.create_or_update(sweep_job)
Monitor SweepJobs runs#
aml_url = returned_sweep_job.studio_url
print("Monitor your job at", aml_url)
Monitor your job at https://ml.azure.com/runs/eager_turtle_r7fs2xzcty?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a
Find and Register Best Model#
Download the best trial model output
ml_client.jobs.download(returned_sweep_job.name, output_name="model")
Delete Cluster#
ml_client.compute.begin_delete(gpu_compute_target).wait()