Running RAPIDS hyperparameter experiments at scale on Amazon SageMaker#

Import packages and create Amazon SageMaker and Boto3 sessions#

import sagemaker
import time
import boto3
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")
account, region
('561241433344', 'us-west-2')

Upload the higgs-boson dataset to s3 bucket#

!mkdir dataset
!wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
!gunzip dataset/HIGGS.csv.gz
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")
s3_data_dir
's3://sagemaker-us-west-2-561241433344/dataset/higgs-dataset'

Download latest RAPIDS container from DockerHub#

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to DockerHub.

You will need to extend this container by creating a Dockerfile, copying the training script and installing SageMaker Training toolkit to makes RAPIDS compatible with SageMaker

estimator_info = {
    "rapids_container": "nvcr.io/nvidia/rapidsai/base:24.04-cuda11.8-py3.10",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}
%%time
!docker pull {estimator_info['rapids_container']}
22.12-cuda11.5-runtime-ubuntu18.04-py3.9: Pulling from rapidsai/rapidsai-core

e5416296: Pulling fs layer 
2d3ed59c: Pulling fs layer 
1b38369f: Pulling fs layer 
4c8e4d7e: Pulling fs layer 
a06239d6: Pulling fs layer 
cb87b249: Pulling fs layer 
61c55367: Pulling fs layer 
fb9847e6: Pulling fs layer 
0cc4d9ef: Pulling fs layer 
161bebe2: Pull complete 932GB/3.932GBBExtracting  14.94MB/23.41MBDownloading  355.3MB/3.932GBDownloading  706.6MB/1.552GBExtracting  447.9MB/1.552GBExtracting  1.169GB/1.552GBExtracting  3.003GB/3.932GBDigest: sha256:959a2e80642e881ef99705473d95165cda8383543cff4ae5ca554da782021e47
Status: Downloaded newer image for rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
docker.io/rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
CPU times: user 5.79 s, sys: 1 s, total: 6.79 s
Wall time: 4min 10s
!cat Dockerfile
ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# add sagemaker-training-toolkit [ requires build tools ], flask [ serving ], and dask-ml
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
    && source activate rapids \
    && pip3 install sagemaker-training cupy-cuda11x flask \
    && pip3 install --upgrade protobuf

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
ENV SAGEMAKER_PROGRAM rapids-higgs.py
!docker build -t sagemaker-rapids-higgs --build-arg RAPIDS_IMAGE nvcr.io/nvidia/rapidsai/base:24.04-cuda11.8-py3.10 .
Sending build context to Docker daemon  10.75kB
Step 1/4 : FROM rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
 ---> 9de590bd08c5
Step 2/4 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential     && source activate rapids     && pip3 install sagemaker-training cupy-cuda11x flask     && pip3 install --upgrade protobuf
 ---> Running in bc5688af0059
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1581 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [1124 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:5 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [3161 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages [1344 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages [186 kB]
Get:10 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [1389 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages [11.3 MB]
Get:12 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [23.8 kB]
Get:13 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1578 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages [13.5 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2353 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [30.8 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [1430 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3581 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic-backports/main amd64 Packages [64.0 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages [20.5 kB]
Fetched 28.1 MB in 4s (7151 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-7 dpkg-dev g++
  g++-7 gcc gcc-7 gcc-7-base libasan4 libatomic1 libbinutils libcc1-0
  libcilkrts5 libdpkg-perl libgcc-7-dev libgomp1 libisl19 libitm1 liblsan0
  libmpc3 libmpfr6 libmpx2 libquadmath0 libstdc++-7-dev libtsan0 libubsan0
  make xz-utils
Suggested packages:
  binutils-doc cpp-doc gcc-7-locales debian-keyring g++-multilib
  g++-7-multilib gcc-7-doc libstdc++6-7-dbg gcc-multilib manpages-dev libtool
  flex bison gdb gcc-doc gcc-7-multilib libgcc1-dbg libgomp1-dbg libitm1-dbg
  libatomic1-dbg libasan4-dbg liblsan0-dbg libtsan0-dbg libubsan0-dbg
  libcilkrts5-dbg libmpx2-dbg libquadmath0-dbg bzr libstdc++-7-doc make-doc
Recommended packages:
  fakeroot libalgorithm-merge-perl libfile-fcntllock-perl
  liblocale-gettext-perl
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu build-essential cpp cpp-7
  dpkg-dev g++ g++-7 gcc gcc-7 gcc-7-base libasan4 libatomic1 libbinutils
  libcc1-0 libcilkrts5 libdpkg-perl libgcc-7-dev libgomp1 libisl19 libitm1
  liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0 libstdc++-7-dev libtsan0
  libubsan0 make xz-utils
0 upgraded, 32 newly installed, 0 to remove and 26 not upgraded.
Need to get 37.2 MB of archives.
After this operation, 137 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 xz-utils amd64 5.2.2-1.3ubuntu0.1 [83.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 binutils-common amd64 2.30-21ubuntu1~18.04.8 [197 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libbinutils amd64 2.30-21ubuntu1~18.04.8 [488 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 binutils-x86-64-linux-gnu amd64 2.30-21ubuntu1~18.04.8 [1839 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 binutils amd64 2.30-21ubuntu1~18.04.8 [3388 B]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 gcc-7-base amd64 7.5.0-3ubuntu1~18.04 [18.3 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 libisl19 amd64 0.19-1 [551 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libmpfr6 amd64 4.0.1-1 [243 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic/main amd64 libmpc3 amd64 1.1.0-1 [40.8 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 cpp-7 amd64 7.5.0-3ubuntu1~18.04 [8591 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 cpp amd64 4:7.4.0-1ubuntu2.3 [27.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcc1-0 amd64 8.4.0-1ubuntu1~18.04 [39.4 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libgomp1 amd64 8.4.0-1ubuntu1~18.04 [76.5 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libitm1 amd64 8.4.0-1ubuntu1~18.04 [27.9 kB]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libatomic1 amd64 8.4.0-1ubuntu1~18.04 [9192 B]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libasan4 amd64 7.5.0-3ubuntu1~18.04 [358 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 liblsan0 amd64 8.4.0-1ubuntu1~18.04 [133 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libtsan0 amd64 8.4.0-1ubuntu1~18.04 [288 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libubsan0 amd64 7.5.0-3ubuntu1~18.04 [126 kB]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcilkrts5 amd64 7.5.0-3ubuntu1~18.04 [42.5 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmpx2 amd64 8.4.0-1ubuntu1~18.04 [11.6 kB]
Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libquadmath0 amd64 8.4.0-1ubuntu1~18.04 [134 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libgcc-7-dev amd64 7.5.0-3ubuntu1~18.04 [2378 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 gcc-7 amd64 7.5.0-3ubuntu1~18.04 [9381 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 gcc amd64 4:7.4.0-1ubuntu2.3 [5184 B]
Get:26 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libstdc++-7-dev amd64 7.5.0-3ubuntu1~18.04 [1471 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 g++-7 amd64 7.5.0-3ubuntu1~18.04 [9697 kB]
Get:28 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 g++ amd64 4:7.4.0-1ubuntu2.3 [1568 B]
Get:29 http://archive.ubuntu.com/ubuntu bionic/main amd64 make amd64 4.1-9.1ubuntu1 [154 kB]
Get:30 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libdpkg-perl all 1.19.0.5ubuntu2.4 [212 kB]
Get:31 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 dpkg-dev all 1.19.0.5ubuntu2.4 [607 kB]
Get:32 http://archive.ubuntu.com/ubuntu bionic/main amd64 build-essential amd64 12.4ubuntu1 [4758 B]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 37.2 MB in 3s (11.7 MB/s)
Selecting previously unselected package xz-utils.
(Reading database ... 13756 files and directories currently installed.)
Preparing to unpack .../00-xz-utils_5.2.2-1.3ubuntu0.1_amd64.deb ...
Unpacking xz-utils (5.2.2-1.3ubuntu0.1) ...
Selecting previously unselected package binutils-common:amd64.
Preparing to unpack .../01-binutils-common_2.30-21ubuntu1~18.04.8_amd64.deb ...
Unpacking binutils-common:amd64 (2.30-21ubuntu1~18.04.8) ...
Selecting previously unselected package libbinutils:amd64.
Preparing to unpack .../02-libbinutils_2.30-21ubuntu1~18.04.8_amd64.deb ...
Unpacking libbinutils:amd64 (2.30-21ubuntu1~18.04.8) ...
Selecting previously unselected package binutils-x86-64-linux-gnu.
Preparing to unpack .../03-binutils-x86-64-linux-gnu_2.30-21ubuntu1~18.04.8_amd64.deb ...
Unpacking binutils-x86-64-linux-gnu (2.30-21ubuntu1~18.04.8) ...
Selecting previously unselected package binutils.
Preparing to unpack .../04-binutils_2.30-21ubuntu1~18.04.8_amd64.deb ...
Unpacking binutils (2.30-21ubuntu1~18.04.8) ...
Selecting previously unselected package gcc-7-base:amd64.
Preparing to unpack .../05-gcc-7-base_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking gcc-7-base:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package libisl19:amd64.
Preparing to unpack .../06-libisl19_0.19-1_amd64.deb ...
Unpacking libisl19:amd64 (0.19-1) ...
Selecting previously unselected package libmpfr6:amd64.
Preparing to unpack .../07-libmpfr6_4.0.1-1_amd64.deb ...
Unpacking libmpfr6:amd64 (4.0.1-1) ...
Selecting previously unselected package libmpc3:amd64.
Preparing to unpack .../08-libmpc3_1.1.0-1_amd64.deb ...
Unpacking libmpc3:amd64 (1.1.0-1) ...
Selecting previously unselected package cpp-7.
Preparing to unpack .../09-cpp-7_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking cpp-7 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package cpp.
Preparing to unpack .../10-cpp_4%3a7.4.0-1ubuntu2.3_amd64.deb ...
Unpacking cpp (4:7.4.0-1ubuntu2.3) ...
Selecting previously unselected package libcc1-0:amd64.
Preparing to unpack .../11-libcc1-0_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libcc1-0:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libgomp1:amd64.
Preparing to unpack .../12-libgomp1_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libgomp1:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libitm1:amd64.
Preparing to unpack .../13-libitm1_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libitm1:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libatomic1:amd64.
Preparing to unpack .../14-libatomic1_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libatomic1:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libasan4:amd64.
Preparing to unpack .../15-libasan4_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking libasan4:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package liblsan0:amd64.
Preparing to unpack .../16-liblsan0_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking liblsan0:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libtsan0:amd64.
Preparing to unpack .../17-libtsan0_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libtsan0:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libubsan0:amd64.
Preparing to unpack .../18-libubsan0_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking libubsan0:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package libcilkrts5:amd64.
Preparing to unpack .../19-libcilkrts5_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking libcilkrts5:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package libmpx2:amd64.
Preparing to unpack .../20-libmpx2_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libmpx2:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libquadmath0:amd64.
Preparing to unpack .../21-libquadmath0_8.4.0-1ubuntu1~18.04_amd64.deb ...
Unpacking libquadmath0:amd64 (8.4.0-1ubuntu1~18.04) ...
Selecting previously unselected package libgcc-7-dev:amd64.
Preparing to unpack .../22-libgcc-7-dev_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking libgcc-7-dev:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package gcc-7.
Preparing to unpack .../23-gcc-7_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking gcc-7 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package gcc.
Preparing to unpack .../24-gcc_4%3a7.4.0-1ubuntu2.3_amd64.deb ...
Unpacking gcc (4:7.4.0-1ubuntu2.3) ...
Selecting previously unselected package libstdc++-7-dev:amd64.
Preparing to unpack .../25-libstdc++-7-dev_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking libstdc++-7-dev:amd64 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package g++-7.
Preparing to unpack .../26-g++-7_7.5.0-3ubuntu1~18.04_amd64.deb ...
Unpacking g++-7 (7.5.0-3ubuntu1~18.04) ...
Selecting previously unselected package g++.
Preparing to unpack .../27-g++_4%3a7.4.0-1ubuntu2.3_amd64.deb ...
Unpacking g++ (4:7.4.0-1ubuntu2.3) ...
Selecting previously unselected package make.
Preparing to unpack .../28-make_4.1-9.1ubuntu1_amd64.deb ...
Unpacking make (4.1-9.1ubuntu1) ...
Selecting previously unselected package libdpkg-perl.
Preparing to unpack .../29-libdpkg-perl_1.19.0.5ubuntu2.4_all.deb ...
Unpacking libdpkg-perl (1.19.0.5ubuntu2.4) ...
Selecting previously unselected package dpkg-dev.
Preparing to unpack .../30-dpkg-dev_1.19.0.5ubuntu2.4_all.deb ...
Unpacking dpkg-dev (1.19.0.5ubuntu2.4) ...
Selecting previously unselected package build-essential.
Preparing to unpack .../31-build-essential_12.4ubuntu1_amd64.deb ...
Unpacking build-essential (12.4ubuntu1) ...
Setting up libquadmath0:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up libgomp1:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up libatomic1:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up libcc1-0:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up make (4.1-9.1ubuntu1) ...
Setting up libtsan0:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up libmpfr6:amd64 (4.0.1-1) ...
Setting up libdpkg-perl (1.19.0.5ubuntu2.4) ...
Setting up liblsan0:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up gcc-7-base:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up binutils-common:amd64 (2.30-21ubuntu1~18.04.8) ...
Setting up libmpx2:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up xz-utils (5.2.2-1.3ubuntu0.1) ...
update-alternatives: using /usr/bin/xz to provide /usr/bin/lzma (lzma) in auto mode
update-alternatives: warning: skip creation of /usr/share/man/man1/lzma.1.gz because associated file /usr/share/man/man1/xz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/unlzma.1.gz because associated file /usr/share/man/man1/unxz.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcat.1.gz because associated file /usr/share/man/man1/xzcat.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzmore.1.gz because associated file /usr/share/man/man1/xzmore.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzless.1.gz because associated file /usr/share/man/man1/xzless.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzdiff.1.gz because associated file /usr/share/man/man1/xzdiff.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzcmp.1.gz because associated file /usr/share/man/man1/xzcmp.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzgrep.1.gz because associated file /usr/share/man/man1/xzgrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzegrep.1.gz because associated file /usr/share/man/man1/xzegrep.1.gz (of link group lzma) doesn't exist
update-alternatives: warning: skip creation of /usr/share/man/man1/lzfgrep.1.gz because associated file /usr/share/man/man1/xzfgrep.1.gz (of link group lzma) doesn't exist
Setting up libmpc3:amd64 (1.1.0-1) ...
Setting up libitm1:amd64 (8.4.0-1ubuntu1~18.04) ...
Setting up libisl19:amd64 (0.19-1) ...
Setting up libasan4:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up libbinutils:amd64 (2.30-21ubuntu1~18.04.8) ...
Setting up libcilkrts5:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up libubsan0:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up libgcc-7-dev:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up cpp-7 (7.5.0-3ubuntu1~18.04) ...
Setting up libstdc++-7-dev:amd64 (7.5.0-3ubuntu1~18.04) ...
Setting up binutils-x86-64-linux-gnu (2.30-21ubuntu1~18.04.8) ...
Setting up cpp (4:7.4.0-1ubuntu2.3) ...
Setting up binutils (2.30-21ubuntu1~18.04.8) ...
Setting up gcc-7 (7.5.0-3ubuntu1~18.04) ...
Setting up g++-7 (7.5.0-3ubuntu1~18.04) ...
Setting up gcc (4:7.4.0-1ubuntu2.3) ...
Setting up dpkg-dev (1.19.0.5ubuntu2.4) ...
Setting up g++ (4:7.4.0-1ubuntu2.3) ...
update-alternatives: using /usr/bin/g++ to provide /usr/bin/c++ (c++) in auto mode
update-alternatives: warning: skip creation of /usr/share/man/man1/c++.1.gz because associated file /usr/share/man/man1/g++.1.gz (of link group c++) doesn't exist
Setting up build-essential (12.4ubuntu1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.6) ...
Collecting sagemaker-training
  Downloading sagemaker_training-4.4.5.tar.gz (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.6/58.6 kB 4.5 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting cupy-cuda11x
  Downloading cupy_cuda115-10.6.0-cp39-cp39-manylinux1_x86_64.whl (81.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.5/81.5 MB 18.4 MB/s eta 0:00:00
Collecting flask
  Downloading Flask-2.2.2-py3-none-any.whl (101 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.5/101.5 kB 31.1 MB/s eta 0:00:00
Requirement already satisfied: numpy in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (1.23.5)
Collecting boto3
  Downloading boto3-1.26.65-py3-none-any.whl (132 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.7/132.7 kB 29.0 MB/s eta 0:00:00
Requirement already satisfied: six in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (1.16.0)
Requirement already satisfied: pip in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (22.3.1)
Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Collecting gevent
  Downloading gevent-22.10.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.4/6.4 MB 106.5 MB/s eta 0:00:00
Collecting inotify_simple==1.2.1
  Downloading inotify_simple-1.2.1.tar.gz (7.9 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting werkzeug>=0.15.5
  Downloading Werkzeug-2.2.2-py3-none-any.whl (232 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.7/232.7 kB 48.1 MB/s eta 0:00:00
Collecting paramiko>=2.4.2
  Downloading paramiko-3.0.0-py3-none-any.whl (210 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.8/210.8 kB 43.6 MB/s eta 0:00:00
Requirement already satisfied: psutil>=5.6.7 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (5.9.4)
Requirement already satisfied: protobuf<=3.20.2,>=3.9.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (3.20.2)
Requirement already satisfied: scipy>=1.2.2 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from sagemaker-training) (1.6.0)
Requirement already satisfied: fastrlock>=0.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cupy-cuda11x) (0.8)
Requirement already satisfied: Jinja2>=3.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from flask) (3.1.2)
Requirement already satisfied: click>=8.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from flask) (8.1.3)
Collecting itsdangerous>=2.0
  Downloading itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Requirement already satisfied: importlib-metadata>=3.6.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from flask) (5.1.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from importlib-metadata>=3.6.0->flask) (3.11.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from Jinja2>=3.0->flask) (2.1.1)
Collecting pynacl>=1.5
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 8.2 MB/s eta 0:00:00
Collecting bcrypt>=3.2
  Downloading bcrypt-4.0.1-cp36-abi3-manylinux_2_24_x86_64.whl (593 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 593.2/593.2 kB 83.3 MB/s eta 0:00:00
Requirement already satisfied: cryptography>=3.3 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from paramiko>=2.4.2->sagemaker-training) (38.0.4)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from boto3->sagemaker-training) (1.0.1)
Collecting botocore<1.30.0,>=1.29.65
  Downloading botocore-1.29.65-py3-none-any.whl (10.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 107.3 MB/s eta 0:00:00
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.whl (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.6/79.6 kB 22.9 MB/s eta 0:00:00
Requirement already satisfied: setuptools in /opt/conda/envs/rapids/lib/python3.9/site-packages (from gevent->sagemaker-training) (65.5.1)
Collecting zope.interface
  Downloading zope.interface-5.5.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (257 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 257.9/257.9 kB 49.1 MB/s eta 0:00:00
Collecting zope.event
  Downloading zope.event-4.6-py2.py3-none-any.whl (6.8 kB)
Collecting greenlet>=2.0.0
  Downloading greenlet-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (610 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 610.9/610.9 kB 81.4 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from botocore<1.30.0,>=1.29.65->boto3->sagemaker-training) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from botocore<1.30.0,>=1.29.65->boto3->sagemaker-training) (1.26.13)
Requirement already satisfied: cffi>=1.12 in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cryptography>=3.3->paramiko>=2.4.2->sagemaker-training) (1.15.1)
Requirement already satisfied: pycparser in /opt/conda/envs/rapids/lib/python3.9/site-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.4.2->sagemaker-training) (2.21)
Building wheels for collected packages: sagemaker-training, inotify_simple
  Building wheel for sagemaker-training (setup.py): started
  Building wheel for sagemaker-training (setup.py): finished with status 'done'
  Created wheel for sagemaker-training: filename=sagemaker_training-4.4.5-cp39-cp39-linux_x86_64.whl size=77869 sha256=525d9529335cbe745c94db978088ec5e16fc6df7ee9b5d8253e3a2b6eb20aedf
  Stored in directory: /root/.cache/pip/wheels/27/ce/8c/61fd993cc09c869afdfe6fa5dda848e6e66ba38f10357fa9bd
  Building wheel for inotify_simple (setup.py): started
  Building wheel for inotify_simple (setup.py): finished with status 'done'
  Created wheel for inotify_simple: filename=inotify_simple-1.2.1-py3-none-any.whl size=8201 sha256=bab7fdd63ad2075fb3bdad0ed1e910ab4e048c819ccf71eaf538e515659ae701
  Stored in directory: /root/.cache/pip/wheels/3f/c2/6a/6f6c65836d2fad9ae7008373d82e38b519187113fac6b720c8
Successfully built sagemaker-training inotify_simple
Installing collected packages: inotify_simple, zope.interface, zope.event, werkzeug, retrying, itsdangerous, greenlet, cupy-cuda11x, bcrypt, pynacl, gevent, flask, botocore, s3transfer, paramiko, boto3, sagemaker-training
  Attempting uninstall: botocore
    Found existing installation: botocore 1.27.59
    Uninstalling botocore-1.27.59:
      Successfully uninstalled botocore-1.27.59
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.4.0 requires botocore<1.27.60,>=1.27.59, but you have botocore 1.29.65 which is incompatible.
Successfully installed bcrypt-4.0.1 boto3-1.26.65 botocore-1.29.65 cupy-cuda11x-10.6.0 flask-2.2.2 gevent-22.10.2 greenlet-2.0.2 inotify_simple-1.2.1 itsdangerous-2.1.2 paramiko-3.0.0 pynacl-1.5.0 retrying-1.3.4 s3transfer-0.6.0 sagemaker-training-4.4.5 werkzeug-2.2.2 zope.event-4.6 zope.interface-5.5.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Requirement already satisfied: protobuf in /opt/conda/envs/rapids/lib/python3.9/site-packages (3.20.2)
Collecting protobuf
  Downloading protobuf-4.21.12-cp37-abi3-manylinux2014_x86_64.whl (409 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 409.8/409.8 kB 14.2 MB/s eta 0:00:00
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.2
    Uninstalling protobuf-3.20.2:
      Successfully uninstalled protobuf-3.20.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 22.12.1 requires cupy-cuda11x, which is not installed.
sagemaker-training 4.4.5 requires protobuf<=3.20.2,>=3.9.2, but you have protobuf 4.21.12 which is incompatible.
cudf 22.12.1 requires protobuf<3.21.0a0,>=3.20.1, but you have protobuf 4.21.12 which is incompatible.
Successfully installed protobuf-4.21.12
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Removing intermediate container bc5688af0059
 ---> b58e1fb082e7
Step 3/4 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
 ---> 7020b9ee91b2
Step 4/4 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
 ---> Running in b01db1cfa675
Removing intermediate container b01db1cfa675
 ---> 2af65998a4b2
Successfully built 2af65998a4b2
Successfully tagged sagemaker-rapids-higgs:latest
!docker images
REPOSITORY               TAG                                        IMAGE ID       CREATED                  SIZE
sagemaker-rapids-higgs   latest                                     2af65998a4b2   Less than a second ago   13.7GB
rapidsai/rapidsai-core   22.12-cuda11.5-runtime-ubuntu18.04-py3.9   9de590bd08c5   7 weeks ago              13.1GB

Publish to Elastic Container Registry#

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster.

Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)
ECR_container_fullname
'561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs:22.12-cuda11.5-runtime-ubuntu18.04-py3.9'
!docker tag {estimator_info['rapids_container']} {ECR_container_fullname}
print(
    f"source      : {estimator_info['rapids_container']}\n"
    f"destination : {ECR_container_fullname}"
)
source      : rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
destination : 561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs:22.12-cuda11.5-runtime-ubuntu18.04-py3.9
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:561241433344:repository/sagemaker-rapids-higgs",
        "registryId": "561241433344",
        "repositoryName": "sagemaker-rapids-higgs",
        "repositoryUri": "561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs",
        "createdAt": 1675720898.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
!docker push {ECR_container_fullname}
The push refers to repository [561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-higgs]

601675bf: Preparing 
a211643c: Preparing 
51d8b000: Preparing 
f7b7f229: Preparing 
48598b79: Preparing 
2b6403fc: Preparing 
ca9f5267: Preparing 
e36e26b2: Preparing 
2c4843ad: Preparing 
01675bf: Pushed   7.197GB/7.157GBPushing  46.22MB/1.823GBPushing  332.4MB/7.157GBPushing  1.682GB/7.157GBPushing  2.411GB/7.157GBPushing  3.426GB/3.649GBPushing  5.013GB/7.157GB22.12-cuda11.5-runtime-ubuntu18.04-py3.9: digest: sha256:959a2e80642e881ef99705473d95165cda8383543cff4ae5ca554da782021e47 size: 2432

Testing your Amazon SageMaker compatible RAPIDS container locally#

Before you go off and spend time and money on running a large experiment on a large cluster, you should run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have SageMaker SDK installed on your local machine.

Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the cuML docs page.

hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

Now, specify the instance type as local_gpu. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace local_gpu with with a p3 or p2 GPU instance by updating the instance_type variable.

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
%%time
rapids_estimator.fit(inputs=s3_data_dir)
INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2023-02-07-03-57-40-523
2023-02-07 03:57:41 Starting - Starting the training job...
2023-02-07 03:58:10 Starting - Preparing the instances for training.........
2023-02-07 03:59:21 Downloading - Downloading input data........................
2023-02-07 04:03:38 Training - Training image download completed. Training in progress...[WARN  tini (7)] Tini is not running as PID 1 .
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, run Tini as PID 1.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf
[I 2023-02-07 04:03:52.469 ServerApp] dask_labextension | extension was successfully linked.
[I 2023-02-07 04:03:52.470 ServerApp] jupyter_server_proxy | extension was successfully linked.
[W 2023-02-07 04:03:52.472 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2023-02-07 04:03:52.472 LabApp] 'allow_origin' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2023-02-07 04:03:52.472 LabApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2023-02-07 04:03:52.476 ServerApp] jupyterlab | extension was successfully linked.
[I 2023-02-07 04:03:52.476 ServerApp] jupyterlab_nvdashboard | extension was successfully linked.
[I 2023-02-07 04:03:52.480 ServerApp] nbclassic | extension was successfully linked.
[I 2023-02-07 04:03:52.481 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2023-02-07 04:03:52.485 ServerApp] notebook_shim | extension was successfully linked.
[I 2023-02-07 04:03:52.485 ServerApp] panel.io.jupyter_server_extension | extension was successfully linked.
[W 2023-02-07 04:03:52.508 ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 2023-02-07 04:03:52.510 ServerApp] notebook_shim | extension was successfully loaded.
[I 2023-02-07 04:03:52.511 ServerApp] dask_labextension | extension was successfully loaded.
[I 2023-02-07 04:03:53.114 ServerApp] jupyter_server_proxy | extension was successfully loaded.
[I 2023-02-07 04:03:53.115 LabApp] JupyterLab extension loaded from /opt/conda/envs/rapids/lib/python3.9/site-packages/jupyterlab
[I 2023-02-07 04:03:53.115 LabApp] JupyterLab application directory is /opt/conda/envs/rapids/share/jupyter/lab
[I 2023-02-07 04:03:53.119 ServerApp] jupyterlab | extension was successfully loaded.
[W 2023-02-07 04:03:53.119 ServerApp] jupyterlab_nvdashboard | extension failed loading with message: 'NoneType' object is not callable
[E 2023-02-07 04:03:53.119 ServerApp] jupyterlab_nvdashboard | stack trace
    Traceback (most recent call last):
      File "/opt/conda/envs/rapids/lib/python3.9/site-packages/jupyter_server/extension/manager.py", line 355, in load_extension
        extension.load_all_points(self.serverapp)
      File "/opt/conda/envs/rapids/lib/python3.9/site-packages/jupyter_server/extension/manager.py", line 229, in load_all_points
        return [self.load_point(point_name, serverapp) for point_name in self.extension_points]
      File "/opt/conda/envs/rapids/lib/python3.9/site-packages/jupyter_server/extension/manager.py", line 229, in <listcomp>
        return [self.load_point(point_name, serverapp) for point_name in self.extension_points]
      File "/opt/conda/envs/rapids/lib/python3.9/site-packages/jupyter_server/extension/manager.py", line 222, in load_point
        return point.load(serverapp)
      File "/opt/conda/envs/rapids/lib/python3.9/site-packages/jupyter_server/extension/manager.py", line 148, in load
        return loader(serverapp)
    TypeError: 'NoneType' object is not callable
[I 2023-02-07 04:03:53.123 ServerApp] nbclassic | extension was successfully loaded.
[I 2023-02-07 04:03:53.124 ServerApp] panel.io.jupyter_server_extension | extension was successfully loaded.
[I 2023-02-07 04:03:53.125 ServerApp] Serving notebooks from local directory: /rapids/notebooks
[I 2023-02-07 04:03:53.125 ServerApp] Jupyter Server 1.23.3 is running at:
[I 2023-02-07 04:03:53.125 ServerApp] http://ip-10-0-233-172.us-west-2.compute.internal:8888/lab
[I 2023-02-07 04:03:53.125 ServerApp]  or http://127.0.0.1:8888/lab
[I 2023-02-07 04:03:53.125 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

Define hyperparameter ranges and run a large-scale search experiment#

There’s not a whole lot of code changes required to go from local training to training at scale. First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — ml.p3.8xlarge. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

Now you define a HyperparameterTuner object using the estimator you defined above.

tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)
WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
INFO:sagemaker:Creating hyperparameter tuning job with name: rapidsHPO2023-02-07-16-09-47-038
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Clean up#

  • Delete S3 buckets and files you don’t need

  • Kill training jobs that you don’t want running

  • Delete container images and the repository you just created

aws ecr delete-repository --force --repository-name