# Running RAPIDS hyperparameter experiments at scale on Amazon SageMaker

### Import packages and create Amazon SageMaker and Boto3 sessions

In [1]:
import sagemaker
import time
import boto3

In [2]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")

In [3]:
account, region

('561241433344', 'us-west-2')

### Upload the higgs-boson dataset to s3 bucket

In [None]:
!mkdir dataset
!wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
!gunzip dataset/HIGGS.csv.gz

In [4]:
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")

In [5]:
s3_data_dir

's3://sagemaker-us-west-2-561241433344/dataset/higgs-dataset'

### Download latest RAPIDS container from DockerHub

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-core).

You will need to extend this container by creating a Dockerfile, copying the training script and installing [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) to makes RAPIDS compatible with SageMaker 

In [6]:
estimator_info = {
    "rapids_container": "nvcr.io/nvidia/rapidsai/base:24.04-cuda11.8-py3.10",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}

In [7]:
%%time
!docker pull {estimator_info['rapids_container']}

22.12-cuda11.5-runtime-ubuntu18.04-py3.9: Pulling from rapidsai/rapidsai-core

[1Be5416296: Pulling fs layer 
[1B2d3ed59c: Pulling fs layer 
[1B1b38369f: Pulling fs layer 
[1B4c8e4d7e: Pulling fs layer 
[1Ba06239d6: Pulling fs layer 
[1Bcb87b249: Pulling fs layer 
[1B61c55367: Pulling fs layer 
[1Bfb9847e6: Pulling fs layer 
[1B0cc4d9ef: Pulling fs layer 
[1B161bebe2: Pull complete 932GB/3.932GBB[10A[2K[10A[2K[9A[2K[10A[2K[8A[2K[9A[2K[8A[2K[8A[2K[10A[2K[8A[2K[10A[2K[10A[2K[6A[2K[10A[2K[10A[2K[4A[2K[3A[2K[10A[2K[5A[2K[10A[2K[5A[2K[10A[2K[3A[2K[10A[2K[3A[2K[10A[2K[3A[2K[10A[2K[2A[2K[2A[2K[5A[2K[5A[2K[3A[2K[10A[2K[5A[2K[10A[2K[3A[2K[5A[2K[10A[2K[5A[2K[10A[2K[2A[2K[10A[2K[5A[2K[2A[2K[10A[2K[2A[2K[9A[2K[1A[2K[9A[2K[1A[2K[9A[2K[1A[2K[5A[2K[9A[2K[5A[2K[2A[2K[2A[2K[9A[2K[5A[2K[9A[2K[2A[2K[9A[2K[5A[2K[1A[2K[9A[2K[1A[2K[9A[2K[2A[2K[5A[2K[1A[2K[5

In [1]:
!cat Dockerfile

ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# add sagemaker-training-toolkit [ requires build tools ], flask [ serving ], and dask-ml
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
    && source activate rapids \
    && pip3 install sagemaker-training cupy-cuda11x flask \
    && pip3 install --upgrade protobuf

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
ENV SAGEMAKER_PROGRAM rapids-higgs.py


In [10]:
!docker build -t sagemaker-rapids-higgs --build-arg RAPIDS_IMAGE nvcr.io/nvidia/rapidsai/base:24.04-cuda11.8-py3.10 .

Sending build context to Docker daemon  10.75kB
Step 1/4 : FROM rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
 ---> 9de590bd08c5
Step 2/4 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential     && source activate rapids     && pip3 install sagemaker-training cupy-cuda11x flask     && pip3 install --upgrade protobuf
 ---> Running in bc5688af0059
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [1124 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [3161 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Get:8 http://archive.ubun

In [11]:
!docker images

REPOSITORY               TAG                                        IMAGE ID       CREATED                  SIZE
sagemaker-rapids-higgs   latest                                     2af65998a4b2   Less than a second ago   13.7GB
rapidsai/rapidsai-core   22.12-cuda11.5-runtime-ubuntu18.04-py3.9   9de590bd08c5   7 weeks ago              13.1GB


### Publish to Elastic Container Registry

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster. 

Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push
the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

In [12]:
ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)

In [13]:
ECR_container_fullname

'561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs:22.12-cuda11.5-runtime-ubuntu18.04-py3.9'

In [14]:
!docker tag {estimator_info['rapids_container']} {ECR_container_fullname}

In [15]:
print(
    f"source      : {estimator_info['rapids_container']}\n"
    f"destination : {ECR_container_fullname}"
)

source      : rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
destination : 561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs:22.12-cuda11.5-runtime-ubuntu18.04-py3.9


In [16]:
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})

{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:561241433344:repository/sagemaker-rapids-higgs",
        "registryId": "561241433344",
        "repositoryName": "sagemaker-rapids-higgs",
        "repositoryUri": "561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs",
        "createdAt": 1675720898.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [17]:
!docker push {ECR_container_fullname}

The push refers to repository [561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs]

[1B601675bf: Preparing 
[1Ba211643c: Preparing 
[1B51d8b000: Preparing 
[1Bf7b7f229: Preparing 
[1B48598b79: Preparing 
[1B2b6403fc: Preparing 
[1Bca9f5267: Preparing 
[1Be36e26b2: Preparing 
[1B2c4843ad: Preparing 
[10B01675bf: Pushed   7.197GB/7.157GB[7A[2K[6A[2K[10A[2K[6A[2K[8A[2K[6A[2K[9A[2K[6A[2K[8A[2K[10A[2K[6A[2K[8A[2K[6A[2K[6A[2K[10A[2K[8A[2K[10A[2K[8A[2K[9A[2K[8A[2K[6A[2K[9A[2K[8A[2K[6A[2K[8A[2K[10A[2K[9A[2K[10A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[5A[2K[8A[2K[9A[2K[6A[2K[5A[2K[8A[2K[10A[2K[8A[2K[9A[2K[8A[2K[10A[2K[8A[2K[9A[2K[6A[2K[6A[2K[8A[2K[6A[2K[10A[2K[10A[2K[8A[2K[6A[2K[8A[2K[6A[2K[10A[2K[8A[2K[10A[2K[9A[2K[10A[2K[9A[2K[10A[2K[8A[2K[9A[2K[6A[2K[8A[2K[10A[2K[8A[2K[6A[2K[8A[2K[10A[2K[8A[2K[6A[2K[9A[2K[10A[2K[6A[2K[9

### Testing your Amazon SageMaker compatible RAPIDS container locally

Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have [SageMaker SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk) installed on your local machine.

Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the [cuML docs](https://docs.rapids.ai/api/cuml/stable/api.html#random-forest) page.

In [18]:
hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

Now, specify the instance type as `local_gpu`. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace `local_gpu` with with a `p3` or `p2` GPU instance by updating the `instance_type` variable.

In [26]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

In [None]:
%%time
rapids_estimator.fit(inputs=s3_data_dir)

INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2023-02-07-03-57-40-523


2023-02-07 03:57:41 Starting - Starting the training job...
2023-02-07 03:58:10 Starting - Preparing the instances for training.........
2023-02-07 03:59:21 Downloading - Downloading input data........................
2023-02-07 04:03:38 Training - Training image download completed. Training in progress...[34m[WARN  tini (7)] Tini is not running as PID 1 .[0m
[34mZombie processes will not be re-parented to Tini, so zombie reaping won't work.[0m
[34mTo fix the problem, run Tini as PID 1.[0m
[34mThis container image and its contents are governed by the NVIDIA Deep Learning Container License.[0m
[34mBy pulling and using the container, you accept the terms and conditions of this license:[0m
[34mhttps://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf[0m
[34m[I 2023-02-07 04:03:52.469 ServerApp] dask_labextension | extension was successfully linked.[0m
[34m[I 2023-02-07 04:03:52.470 ServerApp] jupyter_server_proxy | extension was successfully 

Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

### Define hyperparameter ranges and run a large-scale search experiment
There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:

In [28]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — `ml.p3.8xlarge`. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

In [29]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

Now you define a HyperparameterTuner object using the estimator you defined above.

In [30]:
tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

In [None]:
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)

INFO:sagemaker:Creating hyperparameter tuning job with name: rapidsHPO2023-02-07-16-09-47-038


........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## Clean up

- Delete S3 buckets and files you don't need
- Kill training jobs that you don't want running
- Delete container images and the repository you just created

In [None]:
aws ecr delete-repository --force --repository-name