Running RAPIDS Hyperparameter Experiments at Scale on Amazon SageMaker#
Import packages and create Amazon SageMaker and Boto3 sessions#
import time
import boto3
import sagemaker
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()
region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")
account, region
('561241433344', 'us-east-2')
Upload the higgs-boson dataset to s3 bucket#
!mkdir -p ./dataset
!if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi
!if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")
s3_data_dir
's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'
Download latest RAPIDS container from DockerHub#
To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to DockerHub.
You will need to extend this container by creating a Dockerfile, copying the training script and installing SageMaker Training toolkit to makes RAPIDS compatible with SageMaker
estimator_info = { "rapids_container": "nvcr.io/nvidia/rapidsai/base:24.10-cuda12.5-py3.12", "ecr_image": "sagemaker-rapids-higgs:latest", "ecr_repository": "sagemaker-rapids-higgs", }
%%time
!docker pull {estimator_info['rapids_container']}
24.06a-cuda11.8-py3.10: Pulling from rapidsai/base
8493d397: Pulling fs layer
7ee77381: Pulling fs layer
37f007fd: Pulling fs layer
774ad2ec: Pulling fs layer
22adee62: Pulling fs layer
68414a39: Pulling fs layer
3710f323: Pulling fs layer
8390d4e8: Pulling fs layer
d0879975: Pulling fs layer
1ab494af: Pulling fs layer
763525ea: Pulling fs layer
774ad2ec: Waiting fs layer
4ab83532: Pulling fs layer
2adee62: Waiting fs layer
bc16c2b9: Pulling fs layer
710f323: Waiting fs layer
024a16c8: Pull complete 637B/637B7GBBDigest: sha256:e1995b699520fbe87a0196e3c24b6fecdd7e45797702e7dca49b4f44da1b23dd
Status: Downloaded newer image for rapidsai/base:24.06a-cuda11.8-py3.10
docker.io/rapidsai/base:24.06a-cuda11.8-py3.10
CPU times: user 3.68 s, sys: 990 ms, total: 4.67 s
Wall time: 2min 29s
!cat Dockerfile
ARG RAPIDS_IMAGE
FROM $RAPIDS_IMAGE as rapids
# Installs a few more dependencies
RUN conda install --yes -n base \
cupy \
flask \
protobuf \
sagemaker
# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py
# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]
!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .
Sending build context to Docker daemon 7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base cupy flask protobuf sagemaker
---> Running in f6522ce9b303
Channels:
- rapidsai-nightly
- dask/label/dev
- pytorch
- conda-forge
- nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /opt/conda
added / updated specs:
- cupy
- flask
- protobuf
- sagemaker
The following packages will be downloaded:
package | build
---------------------------|-----------------
blinker-1.8.2 | pyhd8ed1ab_0 14 KB conda-forge
boto3-1.34.118 | pyhd8ed1ab_0 78 KB conda-forge
botocore-1.34.118 |pyge310_1234567_0 6.8 MB conda-forge
dill-0.3.8 | pyhd8ed1ab_0 86 KB conda-forge
flask-3.0.3 | pyhd8ed1ab_0 79 KB conda-forge
google-pasta-0.2.0 | pyh8c360ce_0 42 KB conda-forge
itsdangerous-2.2.0 | pyhd8ed1ab_0 19 KB conda-forge
jmespath-1.0.1 | pyhd8ed1ab_0 21 KB conda-forge
multiprocess-0.70.16 | py310h2372a71_0 238 KB conda-forge
openssl-3.3.1 | h4ab18f5_0 2.8 MB conda-forge
pathos-0.3.2 | pyhd8ed1ab_1 52 KB conda-forge
pox-0.3.4 | pyhd8ed1ab_0 26 KB conda-forge
ppft-1.7.6.8 | pyhd8ed1ab_0 33 KB conda-forge
protobuf-4.25.3 | py310ha8c1f0e_0 325 KB conda-forge
protobuf3-to-dict-0.1.5 | py310hff52083_8 14 KB conda-forge
s3transfer-0.10.1 | pyhd8ed1ab_0 61 KB conda-forge
sagemaker-2.75.1 | pyhd8ed1ab_0 377 KB conda-forge
smdebug-rulesconfig-1.0.1 | pyhd3deb0d_1 20 KB conda-forge
werkzeug-3.0.3 | pyhd8ed1ab_0 237 KB conda-forge
------------------------------------------------------------
Total: 11.2 MB
The following NEW packages will be INSTALLED:
blinker conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0
boto3 conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0
botocore conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0
dill conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0
flask conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0
google-pasta conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0
itsdangerous conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0
jmespath conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0
multiprocess conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0
pathos conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1
pox conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0
ppft conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0
protobuf conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0
protobuf3-to-dict conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8
s3transfer conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0
sagemaker conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0
smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1
werkzeug conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0
The following packages will be UPDATED:
openssl 3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0
Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container f6522ce9b303
---> 883c682b36bc
Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
---> 2f6b3e0bec44
Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
---> Running in df524941c02e
Removing intermediate container df524941c02e
---> 4cf437176c8c
Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh
---> 32d95ff5bd74
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
---> Running in c396fa9e98ad
Removing intermediate container c396fa9e98ad
---> 39f900bfeba0
Successfully built 39f900bfeba0
Successfully tagged sagemaker-rapids-higgs:latest
!docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sagemaker-rapids-higgs latest f198baf959a7 About a minute ago 12GB
rapidsai/base 24.06a-cuda11.8-py3.10 a80bdce0d796 41 hours ago 11.3GB
Publish to Elastic Container Registry#
When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster.
Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.
ECR_container_fullname = (
f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)
ECR_container_fullname
'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'
!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}
print(
f"source : {estimator_info['ecr_image']}\n"
f"destination : {ECR_container_fullname}"
)
source : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})
!docker push {ECR_container_fullname}
The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]
3be3c6f4: Preparing
a7112765: Preparing
5c05c772: Preparing
bdce5066: Preparing
923ec1b3: Preparing
3fcfb3d4: Preparing
bf18a086: Preparing
f3ff1008: Preparing
b6fb91b8: Preparing
7bf1eb99: Preparing
264186e1: Preparing
7d7711e0: Preparing
ee96f292: Preparing
e2a80b3f: Preparing
0a873d7a: Preparing
bcc60d01: Preparing
1dcee623: Preparing
9a46b795: Preparing
5e83c163: Preparing
c05c772: Pushed 643.1MB/637.1MB9Alatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504
Testing your Amazon SageMaker compatible RAPIDS container locally#
Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have SageMaker SDK installed on your local machine.
Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the cuML docs page.
hyperparams = {
"n_estimators": 15,
"max_depth": 5,
"n_bins": 8,
"split_criterion": 0, # GINI:0, ENTROPY:1
"bootstrap": 0, # true: sample with replacement, false: sample without replacement
"max_leaves": -1, # unlimited leaves
"max_features": 0.2,
}
Now, specify the instance type as local_gpu
. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace local_gpu
with with a p3
or p2
GPU instance by updating the instance_type
variable.
from sagemaker.estimator import Estimator
rapids_estimator = Estimator(
image_uri=ECR_container_fullname,
role=execution_role,
instance_count=1,
instance_type="ml.p3.2xlarge", #'local_gpu'
max_run=60 * 60 * 24,
max_wait=(60 * 60 * 24) + 1,
use_spot_instances=True,
hyperparameters=hyperparams,
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
%%time
rapids_estimator.fit(inputs=s3_data_dir)
INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371
2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...@ entrypoint -> launching training script
2024-06-05 02:19:27 Uploading - Uploading generated training modeltest_acc: 0.7133834362030029
2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s
Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.
Define hyperparameter ranges and run a large-scale search experiment#
There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:
from sagemaker.tuner import (
CategoricalParameter,
ContinuousParameter,
HyperparameterTuner,
IntegerParameter,
)
hyperparameter_ranges = {
"n_estimators": IntegerParameter(10, 200),
"max_depth": IntegerParameter(1, 22),
"n_bins": IntegerParameter(5, 24),
"split_criterion": CategoricalParameter([0, 1]),
"bootstrap": CategoricalParameter([True, False]),
"max_features": ContinuousParameter(0.01, 0.5),
}
Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — ml.p3.8xlarge
. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.
from sagemaker.estimator import Estimator
rapids_estimator = Estimator(
image_uri=ECR_container_fullname,
role=execution_role,
instance_count=2,
instance_type="ml.p3.8xlarge",
max_run=60 * 60 * 24,
max_wait=(60 * 60 * 24) + 1,
use_spot_instances=True,
hyperparameters=hyperparams,
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
Now you define a HyperparameterTuner object using the estimator you defined above.
tuner = HyperparameterTuner(
rapids_estimator,
objective_metric_name="test_acc",
hyperparameter_ranges=hyperparameter_ranges,
strategy="Bayesian",
max_jobs=2,
max_parallel_jobs=2,
objective_type="Maximize",
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)
Clean up#
Delete S3 buckets and files you don’t need
Kill training jobs that you don’t want running
Delete container images and the repository you just created
!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}