Google Kubernetes Engine#

RAPIDS can be deployed on Google Cloud via the Google Kubernetes Engine (GKE).

To run RAPIDS you’ll need a Kubernetes cluster with GPUs available.

Prerequisites#

First you’ll need to have the gcloud CLI tool installed along with kubectl, helm, etc for managing Kubernetes.

Ensure you are logged into the gcloud CLI.

$ gcloud init

Create the Kubernetes cluster#

Now we can launch a GPU enabled GKE cluster.

$ gcloud container clusters create rapids-gpu-kubeflow \
  --accelerator type=nvidia-tesla-a100,count=2,gpu-driver-version=latest --machine-type a2-highgpu-2g \
  --zone us-central1-c --release-channel stable

With this command, you’ve launched a GKE cluster called rapids-gpu-kubeflow. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs, along with the latest GPU drivers for the current GKE version.

Note

After creating your cluster, if you get a message saying

CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not
executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin

you will need to install the gke-gcloud-auth-plugin to be able to get the credentials. To do so,

$ gcloud components install gke-gcloud-auth-plugin

Get the cluster credentials#

$ gcloud container clusters get-credentials rapids-gpu-kubeflow \
    --region=us-central1-c

With this command, your kubeconfig is updated with credentials and endpoint information for the rapids-gpu-kubeflow cluster.

Verify drivers#

Verify that the NVIDIA drivers are successfully installed.

$ kubectl get po -A --watch | grep nvidia
kube-system          nvidia-gpu-device-plugin-medium-cos-h5kkz                       2/2     Running   0          3m42s
kube-system          nvidia-gpu-device-plugin-medium-cos-pw89w                       2/2     Running   0          3m42s
kube-system          nvidia-gpu-device-plugin-medium-cos-wdnm9                       2/2     Running   0          3m42s

After GPU device plugin pods are in running state, you are ready to test your cluster.

Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected.

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04"
    resources:
       limits:
         nvidia.com/gpu: 1
EOF

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

If you see Test PASSED in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly.

Next, clean up that Pod.

$ kubectl delete pod cuda-vectoradd
pod "cuda-vectoradd" deleted

Install RAPIDS#

Now that you have a GPU enables Kubernetes cluster on GKE you can install RAPIDS with any of the supported methods.

Clean up#

You can also delete the GKE cluster to stop billing with the following command.

$ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c
Deleting cluster rapids...⠼

Related Examples#

Autoscaling Multi-Tenant Kubernetes Deep-Dive

cloud/gcp/gke tools/dask-operator library/cuspatial library/dask library/cudf data-format/parquet data-storage/gcs platforms/kubernetes

Autoscaling Multi-Tenant Kubernetes Deep-Dive

Perform Time Series Forecasting on Google Kubernetes Engine with NVIDIA GPUs

cloud/gcp/gke tools/dask-operator workflow/hpo workflow/xgboost library/dask library/dask-cuda library/xgboost library/optuna data-storage/gcs platforms/kubernetes

Perform Time Series Forecasting on Google Kubernetes Engine with NVIDIA GPUs