Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker#

Import packages and create Amazon SageMaker and Boto3 sessions#

import time

import boto3
import sagemaker

execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")

account, region

('561241433344', 'us-east-2')

Upload the higgs-boson dataset to s3 bucket#

!mkdir -p ./dataset
!if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi
!if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi

s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")

s3_data_dir

's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'

Download latest RAPIDS container from DockerHub#

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to DockerHub.

You will need to extend this container by creating a Dockerfile, copying the training script and installing SageMaker Training toolkit to makes RAPIDS compatible with SageMaker

estimator_info = {
    "rapids_container": "nvcr.io/nvidia/rapidsai/base:24.10-cuda12.5-py3.12",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}

%%time
!docker pull {estimator_info['rapids_container']}

24.06a-cuda11.8-py3.10: Pulling from rapidsai/base

8493d397: Pulling fs layer 
7ee77381: Pulling fs layer 
37f007fd: Pulling fs layer 
774ad2ec: Pulling fs layer 
22adee62: Pulling fs layer 
68414a39: Pulling fs layer 
3710f323: Pulling fs layer 
8390d4e8: Pulling fs layer 
d0879975: Pulling fs layer 
1ab494af: Pulling fs layer 
763525ea: Pulling fs layer 
774ad2ec: Waiting fs layer 
4ab83532: Pulling fs layer 
2adee62: Waiting fs layer 
bc16c2b9: Pulling fs layer 
710f323: Waiting fs layer 
024a16c8: Pull complete  637B/637B7GBBDigest: sha256:e1995b699520fbe87a0196e3c24b6fecdd7e45797702e7dca49b4f44da1b23dd
Status: Downloaded newer image for rapidsai/base:24.06a-cuda11.8-py3.10
docker.io/rapidsai/base:24.06a-cuda11.8-py3.10
CPU times: user 3.68 s, sys: 990 ms, total: 4.67 s
Wall time: 2min 29s

!cat Dockerfile

ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# Installs a few more dependencies
RUN conda install --yes -n base \
        cupy \
        flask \
        protobuf \
        sagemaker

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py

# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]

!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .

Sending build context to Docker daemon   7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
 ---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base         cupy         flask         protobuf         sagemaker
 ---> Running in f6522ce9b303
Channels:
 - rapidsai-nightly
 - dask/label/dev
 - pytorch
 - conda-forge
 - nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cupy
    - flask
    - protobuf
    - sagemaker


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blinker-1.8.2              |     pyhd8ed1ab_0          14 KB  conda-forge
    boto3-1.34.118             |     pyhd8ed1ab_0          78 KB  conda-forge
    botocore-1.34.118          |pyge310_1234567_0         6.8 MB  conda-forge
    dill-0.3.8                 |     pyhd8ed1ab_0          86 KB  conda-forge
    flask-3.0.3                |     pyhd8ed1ab_0          79 KB  conda-forge
    google-pasta-0.2.0         |     pyh8c360ce_0          42 KB  conda-forge
    itsdangerous-2.2.0         |     pyhd8ed1ab_0          19 KB  conda-forge
    jmespath-1.0.1             |     pyhd8ed1ab_0          21 KB  conda-forge
    multiprocess-0.70.16       |  py310h2372a71_0         238 KB  conda-forge
    openssl-3.3.1              |       h4ab18f5_0         2.8 MB  conda-forge
    pathos-0.3.2               |     pyhd8ed1ab_1          52 KB  conda-forge
    pox-0.3.4                  |     pyhd8ed1ab_0          26 KB  conda-forge
    ppft-1.7.6.8               |     pyhd8ed1ab_0          33 KB  conda-forge
    protobuf-4.25.3            |  py310ha8c1f0e_0         325 KB  conda-forge
    protobuf3-to-dict-0.1.5    |  py310hff52083_8          14 KB  conda-forge
    s3transfer-0.10.1          |     pyhd8ed1ab_0          61 KB  conda-forge
    sagemaker-2.75.1           |     pyhd8ed1ab_0         377 KB  conda-forge
    smdebug-rulesconfig-1.0.1  |     pyhd3deb0d_1          20 KB  conda-forge
    werkzeug-3.0.3             |     pyhd8ed1ab_0         237 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        11.2 MB

The following NEW packages will be INSTALLED:

  blinker            conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0 
  boto3              conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0 
  botocore           conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0 
  dill               conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0 
  flask              conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0 
  google-pasta       conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0 
  itsdangerous       conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0 
  jmespath           conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0 
  multiprocess       conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0 
  pathos             conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1 
  pox                conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0 
  ppft               conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0 
  protobuf           conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0 
  protobuf3-to-dict  conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8 
  s3transfer         conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0 
  sagemaker          conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0 
  smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1 
  werkzeug           conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0 

The following packages will be UPDATED:

  openssl                                  3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container f6522ce9b303
 ---> 883c682b36bc
Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
 ---> 2f6b3e0bec44
Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
 ---> Running in df524941c02e
Removing intermediate container df524941c02e
 ---> 4cf437176c8c
Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh
 ---> 32d95ff5bd74
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
 ---> Running in c396fa9e98ad
Removing intermediate container c396fa9e98ad
 ---> 39f900bfeba0
Successfully built 39f900bfeba0
Successfully tagged sagemaker-rapids-higgs:latest

!docker images

REPOSITORY               TAG                      IMAGE ID       CREATED              SIZE
sagemaker-rapids-higgs   latest                   f198baf959a7   About a minute ago   12GB
rapidsai/base            24.06a-cuda11.8-py3.10   a80bdce0d796   41 hours ago         11.3GB

Publish to Elastic Container Registry#

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster.

Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)

ECR_container_fullname

'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'

!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}

print(
    f"source      : {estimator_info['ecr_image']}\n"
    f"destination : {ECR_container_fullname}"
)

source      : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest

!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})

!docker push {ECR_container_fullname}

The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]

3be3c6f4: Preparing 
a7112765: Preparing 
5c05c772: Preparing 
bdce5066: Preparing 
923ec1b3: Preparing 
3fcfb3d4: Preparing 
bf18a086: Preparing 
f3ff1008: Preparing 
b6fb91b8: Preparing 
7bf1eb99: Preparing 
264186e1: Preparing 
7d7711e0: Preparing 
ee96f292: Preparing 
e2a80b3f: Preparing 
0a873d7a: Preparing 
bcc60d01: Preparing 
1dcee623: Preparing 
9a46b795: Preparing 
5e83c163: Preparing 
c05c772: Pushed   643.1MB/637.1MB9Alatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504

Testing your Amazon SageMaker compatible RAPIDS container locally#

Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have SageMaker SDK installed on your local machine.

Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the cuML docs page.

hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

Now, specify the instance type as local_gpu. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace local_gpu with with a p3 or p2 GPU instance by updating the instance_type variable.

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

%%time
rapids_estimator.fit(inputs=s3_data_dir)

INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371

2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...@ entrypoint -> launching training script 

2024-06-05 02:19:27 Uploading - Uploading generated training modeltest_acc: 0.7133834362030029

2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s

Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

Define hyperparameter ranges and run a large-scale search experiment#

There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:

from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — ml.p3.8xlarge. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

Now you define a HyperparameterTuner object using the estimator you defined above.

tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)

Clean up#

Delete S3 buckets and files you don’t need
Kill training jobs that you don’t want running
Delete container images and the repository you just created

!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}