# Running RAPIDS hyperparameter experiments at scale on Amazon SageMaker

### Import packages and create Amazon SageMaker and Boto3 sessions

In [None]:
import time

import boto3
import sagemaker

In [2]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")

In [3]:
account, region

('561241433344', 'us-east-2')

### Upload the higgs-boson dataset to s3 bucket

In [None]:
!mkdir dataset
!wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
!gunzip dataset/HIGGS.csv.gz

In [6]:
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")

In [7]:
s3_data_dir

's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'

### Download latest RAPIDS container from DockerHub

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to [DockerHub](https://hub.docker.com/r/rapidsai/base/tags).

You will need to extend this container by creating a Dockerfile, copying the training script and installing [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) to makes RAPIDS compatible with SageMaker 

In [8]:
estimator_info = {
    "rapids_container": "rapidsai/base:24.10a-cuda12.5-py3.11",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}

In [9]:
%%time
!docker pull {estimator_info['rapids_container']}

24.06a-cuda11.8-py3.10: Pulling from rapidsai/base

[1B8493d397: Pulling fs layer 
[1B7ee77381: Pulling fs layer 
[1B37f007fd: Pulling fs layer 
[1B774ad2ec: Pulling fs layer 
[1B22adee62: Pulling fs layer 
[1B68414a39: Pulling fs layer 
[1B3710f323: Pulling fs layer 
[1B8390d4e8: Pulling fs layer 
[1Bd0879975: Pulling fs layer 
[1B1ab494af: Pulling fs layer 
[1B763525ea: Pulling fs layer 
[9B774ad2ec: Waiting fs layer 
[1B4ab83532: Pulling fs layer 
[10B2adee62: Waiting fs layer 
[1Bbc16c2b9: Pulling fs layer 
[10B710f323: Waiting fs layer 
[1B024a16c8: Pull complete  637B/637B7GBB[16A[2K[17A[2K[17A[2K[17A[2K[15A[2K[17A[2K[15A[2K[10A[2K[15A[2K[17A[2K[15A[2K[17A[2K[9A[2K[17A[2K[8A[2K[17A[2K[7A[2K[17A[2K[5A[2K[6A[2K[17A[2K[4A[2K[17A[2K[6A[2K[3A[2K[6A[2K[16A[2K[16A[2K[15A[2K[6A[2K[1A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2K[15A[2K[2A[2

In [10]:
!cat Dockerfile

ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# Installs a few more dependencies
RUN conda install --yes -n base \
        cupy \
        flask \
        protobuf \
        sagemaker

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py

# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]


In [23]:
!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .

Sending build context to Docker daemon   7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
 ---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base         cupy         flask         protobuf         sagemaker
 ---> Running in f6522ce9b303
Channels:
 - rapidsai-nightly
 - dask/label/dev
 - pytorch
 - conda-forge
 - nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cupy
    - flask
    - protobuf
    - sagemaker


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blinker-1.8.2              |     pyhd8ed1ab_0          14 KB  conda-forge
    boto3-1.34.118             |     pyhd8ed1ab_0          78 KB  conda-forge
    botocore-1.34.118          |pyge310_1234567_0         6.8 MB  conda-forge
    dil

In [13]:
!docker images

REPOSITORY               TAG                      IMAGE ID       CREATED              SIZE
sagemaker-rapids-higgs   latest                   f198baf959a7   About a minute ago   12GB
rapidsai/base            24.06a-cuda11.8-py3.10   a80bdce0d796   41 hours ago         11.3GB


### Publish to Elastic Container Registry

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster. 

Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push
the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

In [24]:
ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)

In [25]:
ECR_container_fullname

'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'

In [26]:
!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}

In [27]:
print(
    f"source      : {estimator_info['ecr_image']}\n"
    f"destination : {ECR_container_fullname}"
)

source      : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest


In [None]:
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})

In [28]:
!docker push {ECR_container_fullname}

The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]

[1B3be3c6f4: Preparing 
[1Ba7112765: Preparing 
[1B5c05c772: Preparing 
[1Bbdce5066: Preparing 
[1B923ec1b3: Preparing 
[1B3fcfb3d4: Preparing 
[1Bbf18a086: Preparing 
[1Bf3ff1008: Preparing 
[1Bb6fb91b8: Preparing 
[1B7bf1eb99: Preparing 
[1B264186e1: Preparing 
[1B7d7711e0: Preparing 
[1Bee96f292: Preparing 
[1Be2a80b3f: Preparing 
[1B0a873d7a: Preparing 
[1Bbcc60d01: Preparing 
[1B1dcee623: Preparing 
[1B9a46b795: Preparing 
[1B5e83c163: Preparing 
[18Bc05c772: Pushed   643.1MB/637.1MB9A[2K[18A[2K[10A[2K[9A[2K[7A[2K[2A[2K[1A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A[2K[18A

### Testing your Amazon SageMaker compatible RAPIDS container locally

Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have [SageMaker SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk) installed on your local machine.

Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the [cuML docs](https://docs.rapids.ai/api/cuml/~~~rapids_api_docs_version~~~/api.html#random-forest) page.

In [29]:
hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

Now, specify the instance type as `local_gpu`. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace `local_gpu` with with a `p3` or `p2` GPU instance by updating the `instance_type` variable.

In [30]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

In [31]:
%%time
rapids_estimator.fit(inputs=s3_data_dir)

INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371


2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...[34m@ entrypoint -> launching training script [0m

2024-06-05 02:19:27 Uploading - Uploading generated training model[34mtest_acc: 0.7133834362030029[0m

2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s


Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

### Define hyperparameter ranges and run a large-scale search experiment
There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:

In [32]:
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — `ml.p3.8xlarge`. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

In [36]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

Now you define a HyperparameterTuner object using the estimator you defined above.

In [37]:
tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

In [None]:
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)

## Clean up

- Delete S3 buckets and files you don't need
- Kill training jobs that you don't want running
- Delete container images and the repository you just created

In [None]:
!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}