Train and Hyperparameter-Tune with RAPIDS#

Choosing an optimal set of hyperparameters is a daunting task, especially for algorithms like XGBoost that have many hyperparameters to tune.

In this notebook, we will show how to speed up hyperparameter optimization by running multiple training jobs in parallel on Azure Machine Learning (AzureML) service.

Prerequisites#

See Documentation

Create an Azure ML Workspace then follow instructions in Microsoft Azure Machine Learning to launch an Azure ML Compute instance with RAPIDS.

Once your instance is running and you have access to Jupyter save this notebook and run through the cells.

Visit the documentation >>

# verify Azure ML SDK version

%pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.8.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/rapids/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.

Initialize Workspace#

InitializeMLClientclass to handle the workspace you created in the prerequisites step.

You can manually provide the workspace details or call MLClient.from_config(credential, path) to create a workspace object from the details stored in config.json

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Get a handle to the workspace
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id="fc4f4a6b-4041-4b1c-8249-854d68edcf62",
    resource_group_name="rapidsai-deployment",
    workspace_name="rapids-aml-cluster",
)

print(
    "Workspace name: " + ml_client.workspace_name,
    "Subscription id: " + ml_client.subscription_id,
    "Resource group: " + ml_client.resource_group_name,
    sep="\n",
)

Workspace name: rapids-aml-cluster
Subscription id: fc4f4a6b-4041-4b1c-8249-854d68edcf62
Resource group: rapidsai-deployment

Access Data from Datastore URI#

In this example, we will use 20 million rows of the airline dataset. The datastore uri below references a data storage location (path) containing the parquet files

datastore_name = "workspaceartifactstore"
dataset = "airline_20000000.parquet"

# Datastore uri format:
data_uri = f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.workspace_name}/datastores/{datastore_name}/paths/{dataset}"

print("data uri:", "\n", data_uri)

data uri: 
 azureml://subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster/datastores/workspaceartifactstore/paths/airline_20000000.parquet

Create AML Compute#

You will need to create an Azure ML managed compute target (AmlCompute) to serve as the environment for training your model.

This notebook will use 10 nodes for hyperparameter optimization, you can modify max_instances based on available quota in the desired region. Similar to other Azure ML services, there are limits on AmlCompute, this article includes details on the default limits and how to request more quota.

size describes the virtual machine type and size that will be used in the cluster. See “System Requirements” in the RAPIDS docs (link) and “GPU optimized virtual machine sizes” in the Azure docs (link) to identify an instance type.

Let’s create an AmlCompute cluster of Standard_NC12s_v3 (Tesla V100) GPU VMs:

from azure.ai.ml.entities import AmlCompute
from azure.ai.ml.exceptions import MlException

# specify aml compute name.
gpu_compute_target = "rapids-cluster"

try:
    # let's see if the compute target already exists
    gpu_target = ml_client.compute.get(gpu_compute_target)
    print(f"found compute target. Will use {gpu_compute_target}")
except MlException:
    print("Creating a new gpu compute target...")

    gpu_target = AmlCompute(
        name="rapids-cluster",
        type="amlcompute",
        size="STANDARD_NC12S_V3",
        max_instances=5,
        idle_time_before_scale_down=300,
    )
    ml_client.compute.begin_create_or_update(gpu_target).result()

    print(
        f"AMLCompute with name {gpu_target.name} is created, the compute size is {gpu_target.size}"
    )

found compute target. Will use rapids-cluster

Prepare training script#

Make sure current directory contains your code to run on the remote resource. This includes the training script and all its dependencies files. In this example, the training script is provided:

train_rapids.py- entry script for RAPIDS Environment, includes loading dataset into cuDF dataframe, training with Random Forest and inference using cuML.

We will log some parameters and metrics including highest accuracy, using mlflow within the training script:

import mlflow

mlflow.log_metric('Accuracy', np.float(global_best_test_accuracy))

These run metrics will become particularly important when we begin hyperparameter tuning our model in the ‘Tune model hyperparameters’ section.

rapids_script = "./train_rapids.py"
azure_script = "./rapids_csp_azure.py"

Train Model on Remote Compute#

Create Experiment#

Track all the runs in your workspace

experiment_name = "test_rapids_aml_cluster"

Setup Environment#

We’ll be using a custom RAPIDS docker image to [setup the environment]((https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?tabs=python#create-an-environment-from-a-docker-image). This is available in rapidsai/rapidsai repo on DockerHub.

Make sure you have the correct path to the docker build context as os.getcwd(),

# RUN THIS CODE ONCE TO SETUP ENVIRONMENT
import os

from azure.ai.ml.entities import BuildContext, Environment

env_docker_image = Environment(
    build=BuildContext(path=os.getcwd()),
    name="rapids-mlflow",
    description="RAPIDS environment with azureml-mlflow",
)

ml_client.environments.create_or_update(env_docker_image)

Uploading code (0.33 MBs): 100%|██████████| 325450/325450 [00:00<00:00, 2363322.62it/s]

Environment({'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'rapids-mlflow', 'description': 'RAPIDS environment with azureml-mlflow', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourceGroups/rapidsai-deployment/providers/Microsoft.MachineLearningServices/workspaces/rapids-aml-cluster/environments/rapids-mlflow/versions/10', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/skirui1/code', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f9ce47101f0>, 'serialize': <msrest.serialization.Serializer object at 0x7f9ce4710d30>, 'version': '10', 'latest_version': None, 'conda_file': None, 'image': None, 'build': <azure.ai.ml.entities._assets.environment.BuildContext object at 0x7f9ce4713580>, 'inference_config': None, 'os_type': 'Linux', 'arm_type': 'environment_version', 'conda_file_path': None, 'path': None, 'datastore': None, 'upload_hash': None, 'translated_conda_file': None})

Submit the Training Job#

We will configure and run a training job using thecommandclass. The command can be used to run standalone jobs or as a function inside pipelines. inputs is a dictionary of command-line arguments to pass to the training script.

from azure.ai.ml import Input, command

command_job = command(
    environment="rapids-mlflow:1",
    experiment_name=experiment_name,
    code=os.getcwd(),
    inputs={
        "data_dir": Input(type="uri_file", path=data_uri),
        "n_bins": 32,
        "compute": "single-GPU",  # multi-GPU for algorithms via Dask
        "cv_folds": 5,
        "n_estimators": 100,
        "max_depth": 6,
        "max_features": 0.3,
    },
    command=(
        "python train_rapids.py --data_dir ${{inputs.data_dir}} --n_bins ${{inputs.n_bins}} "
        "--compute ${{inputs.compute}} --cv_folds ${{inputs.cv_folds}} --n_estimators ${{inputs.n_estimators}} "
        "--max_depth ${{inputs.max_depth}}  --max_features ${{inputs.max_features}}"
    ),
    compute="rapids-cluster",
)


# submit the command
returned_job = ml_client.jobs.create_or_update(command_job)

# get a URL for the status of the job
returned_job.studio_url

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Uploading code (0.33 MBs): 100%|██████████| 327210/327210 [00:00<00:00, 1802654.05it/s]

'https://ml.azure.com/runs/zen_eye_lm7dcp68jz?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a'

Tune Model Hyperparameters#

We can optimize our model’s hyperparameters and improve the accuracy using Azure Machine Learning’s hyperparameter tuning capabilities.

Start a Hyperparameter Sweep#

Let’s define the hyperparameter space to sweep over. We will tune n_estimators, max_depth and max_features parameters. In this example we will use random sampling to try different configuration sets of hyperparameters and maximize Accuracy.

from azure.ai.ml.sweep import Choice, Uniform

command_job_for_sweep = command_job(
    n_estimators=Choice(values=range(50, 500)),
    max_depth=Choice(values=range(5, 19)),
    max_features=Uniform(min_value=0.2, max_value=1.0),
)

# apply sweep parameter to obtain the sweep_job
sweep_job = command_job_for_sweep.sweep(
    compute="rapids-cluster",
    sampling_algorithm="random",
    primary_metric="Accuracy",
    goal="Maximize",
)


# Define the limits for this sweep
sweep_job.set_limits(
    max_total_trials=10, max_concurrent_trials=2, timeout=18000, trial_timeout=3600
)


# Specify your experiment details
sweep_job.display_name = "RF-rapids-sweep-job"
sweep_job.description = "Run RAPIDS hyperparameter sweep job"

This will launch the RAPIDS training script with parameters that were specified in the cell above.

# submit the hpo job
returned_sweep_job = ml_client.create_or_update(sweep_job)

Monitor SweepJobs runs#

aml_url = returned_sweep_job.studio_url

print("Monitor your job at", aml_url)

Monitor your job at https://ml.azure.com/runs/eager_turtle_r7fs2xzcty?wsid=/subscriptions/fc4f4a6b-4041-4b1c-8249-854d68edcf62/resourcegroups/rapidsai-deployment/workspaces/rapids-aml-cluster&tid=43083d15-7273-40c1-b7db-39efd9ccc17a

Find and Register Best Model#

Download the best trial model output

ml_client.jobs.download(returned_sweep_job.name, output_name="model")

Delete Cluster#

ml_client.compute.begin_delete(gpu_compute_target).wait()