Caching Docker Images For Autoscaling Workloads#
The Dask Autoscaler leverages Dask’s adaptive mode and allows the scheduler to scale the number of workers up and down based on the task graph.
When scaling the Dask cluster up or down, there is no guarantee that newly created worker pods will be scheduled on the same node as previously removed workers. As a result, when a new node is allocated for a worker pod, the cluster will incur a pull penalty due to the need to download the Docker image.
Using a Daemonset to cache images#
To guarantee that each node runs a consistent workload, we will deploy a Kubernetes DaemonSet utilizing the RAPIDS image. This DaemonSet will prevent Dask worker pods created from this image from entering a pending state when tasks are scheduled.
This is an example manifest to deploy a Daemonset with the RAPIDS container.
#caching-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: prepuller namespace: image-cache spec: selector: matchLabels: name: prepuller template: metadata: labels: name: prepuller spec: initContainers: - name: prepuller-1 image: "rapidsai/base:24.12a-cuda12.5-py3.12" command: ["sh", "-c", "'true'"] containers: - name: pause image: gcr.io/google_containers/pause:3.2 resources: limits: cpu: 1m memory: 8Mi requests: cpu: 1m memory: 8Mi
You can create this Daemonset with kubectl
.
$ kubectl apply -f caching-daemonset.yaml
The DaemonSet is deployed in the image-cache
namespace. In the initContainers
section, we specify the image to be pulled and cached within the cluster, utilizing any executable command that terminates successfully. Additionally, the pause
container is used to ensure the pod transitions into a Running state without consuming resources or running any processes.
When deploying the DaemonSet, after all pre-puller pods are running successfully, you can confirm that the images have been cached across all nodes in the cluster. As the Kubernetes cluster is scaled up or down, the DaemonSet will automatically pull and cache the necessary images on any newly added nodes, ensuring consistent image availability throughout