Azure Kubernetes Service#

RAPIDS can be deployed on Azure via the Azure Kubernetes Service (AKS).

To run RAPIDS you’ll need a Kubernetes cluster with GPUs available.

Prerequisites#

First you’ll need to have the az CLI tool installed along with kubectl, helm, etc for managing Kubernetes.

Ensure you are logged into the az CLI.

$ az login

Create the Kubernetes cluster#

Now we can launch a GPU enabled AKS cluster. First launch an AKS cluster.

$ az aks create -g <resource group> -n rapids \
        --enable-managed-identity \
        --node-count 1 \
        --enable-addons monitoring \
        --enable-msi-auth-for-monitoring  \
        --generate-ssh-keys

Once the cluster has created we need to pull the credentials into our local config.

$ az aks get-credentials -g <resource group> --name rapids
Merged "rapids" as current context in ~/.kube/config

Next we need to add an additional node group with GPUs which you can learn more about in the Azure docs.

Note

You will need the GPUDedicatedVHDPreview feature enabled so that NVIDIA drivers are installed automatically.

You can check if this is enabled with:

$ az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}"
Name                                               State
-------------------------------------------------  -------------
Microsoft.ContainerService/GPUDedicatedVHDPreview  NotRegistered

$ az aks nodepool add \
    --resource-group <resource group> \
    --cluster-name rapids \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC48ads_A100_v4 \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3

Here we have added a new pool made up of Standard_NC48ads_A100_v4 instances which each have two A100 GPUs. We’ve also enabled autoscaling between one and three nodes on the pool.

Then we can install the NVIDIA drivers.

$ helm install --wait --generate-name --repo https://helm.ngc.nvidia.com/nvidia \
    -n gpu-operator --create-namespace \
    gpu-operator \
    --set operator.runtimeClass=nvidia-container-runtime

Once our new pool has been created and configured, we can test the cluster.

Let’s create a sample Pod that uses some GPU compute to make sure that everything is working as expected.

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04"
    resources:
       limits:
         nvidia.com/gpu: 1
EOF

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

If you see Test PASSED in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly.

Next, clean up that Pod.

$ kubectl delete pod cuda-vectoradd
pod "cuda-vectoradd" deleted

we should be able to test that we can schedule GPU pods.

Install RAPIDS#

Now that you have a GPU enables Kubernetes cluster on AKS you can install RAPIDS with any of the supported methods.

Clean up#

You can also delete the AKS cluster to stop billing with the following command.

$ az aks delete -g <resource group> -n rapids
/ Running ..