Running GPU Containers in Kubernetes

January 15, 2025

Hello everyone, hope everyone had a great holiday break and are starting the new year off okay. I'm back, and as ever spent the break working on my homelab. (Okay, and a fair amount of time in Satisfactory as well).

Today we're getting into some of the more fun stuff, working with GPUs in Kubernetes.

To recap so far, I've talked about setting up hypervisors at home, and setting up a basic Kubernetes cluster. In case you don't know, Kubernetes is simply an orchestration layer on scheduling your pods across different VMs.

This orchestration makes running numerous pods both on premise or at scale a breeze, only needing to do minimal setup on the machine itself. (If you're curious about how I run Kubernetes at home, I recommend checking out my post on k3s, a lightweight kubernetes )

Why choose Kubernetes for GPU workloads?

It's a fair question, why go through the extra orchestration layer at all? Hey Rob, I have VMs, I have GPUs, why can't I just run my workloads on the VMs directly? You absolutely can, and I did for quite a while. GPUs are massive workhorses that you can add directly into your VM and you can transcode video, train models, right there.

As mentioned though, the problem is scheduling those workloads. By not using an orchestration layer you are left with the task of scheduling those nodes yourself. Both on-premise and cloud that is a daunting task, and adding graphics cards to the list only makes that more complex.

On-Premise/Homelabbing you have a finite number of graphics cards. I won't say how many I personally have, but more than 2 and I can count them on one hand. Having a non-infinite amount of resources makes scheduling those workloads crucial. If I have a transcode/training job I need to run, I don't want to spend time figuring out which nodes have what workload and which one is almost done so I can run the next, I need a way to queue those workloads on it's own.

In the cloud it has a very different problem – cost. While there is an “infinite” amount of compute power available, GPU workloads are expensive, starting at $10/hour expensive. If you are running in the cloud your goals are also directly tied to scheduling. If you're paying that much per hour, you want to run a job as quickly as you can, maximize as much of the power as you're paying for, and then stop the node as quickly as you can before you're paying for reserved cores.

Both of these use cases are prime use cases for Kubernetes, but – what if we could have the best of both worlds? Now, this is stepping out of homelabbing, but as a company, is there another option?

The Hybrid-Model

Admittedly this is stepping out of homelabbing, but let's build a scenario. You're a company who needs heavy GPU workloads. You see the prices that AWS, GCP, and Azure are charging and do the obvious jaw drop. Let's analyze your workload.

Most GPU workloads are going to have a baseline of things running, and most workloads are fairly predictable. You may be training models regularly, you may have video transcodes regularly, and those have metrics involved where you can see roughly on average how many you are doing at any given time. For simplicity's sake, let's say on average you are running 10 GPU workloads at any given time, but the problem is that you can sometimes spike up. Maybe there's an event and everyone uploads videos?

For hard-iron, it's cheaper to simply buy some GPUs, but it can't scale. For cloud you can scale easily, but the cost of keeping those 10GPUs running is pretty large.

Kubernetes again solves this problem easily. Out of the box you can attach on-premise (or in a datacenter, anywhere really) nodes to your cloud-based cluster. If you run AWS/Azure Kubernetes, you can attach on-premise nodes to that cluster, and even better you can give those nodes priority over the cloud nodes.

So to use our example, that means for that average 10GPU workload, let's say we buy 2 servers, and we have 12 GPUs running on them. We attach them to our cloud-based kubernetes cluster, and we give priority to the on premise nodes. With that alone, our base 10GPU workload is now completely runnable on-premise, and we're only paying for the datacenter cost to run that server we bought (negligible compared to running them in the cloud), and the negligible cost of orchestrating kubernetes in the cloud.

Then, we could attach an autoscaler to Kubernetes in our cloud provider so when demand does go up, we can quickly and easily add a few more GPUs into the cluster, knowing that when demand goes down those workloads will be stopped and we have minimized our cost!

Kubernetes really does bring value immediately when running GPUs. So okay, I've talked about why enough, it's time we get into some of the nuts and bolts.

How does Kubernetes schedule pods?

Before we can actually add a GPU to the cluster, let's pause and talk about how the Kubernetes scheduler works, at least at a high level.

The scheduler in Kubernetes is in charge of deciding where your pod is going to run. There's many factors in how it decides, and I'm going to stay very high level for this explanation. Things can include

Are there specific nodes (VMs) that this pod has to run on?
Are there nodes that this pod is not allowed to run on?
Are there nodes that have sufficient resources to run this pod?
Is there any specific hardware requirements for running this pod?
From our example above – is there any affinity for running these pods on one node vs another, any preference? (Like on-prem vs in cloud?)

There's of course more, and there's plenty of nuance there. What I want to convey is that the Scheduler is flexible. You can set up rules that say:

I need 4 cores to run.
I need 4 cores and 3GB of RAM to run
I need 4 cores and 3GB of RAM (but I might go up to 16GB of RAM)
I need 4 cores and 3GB of RAM, and also I need those cores to be ARM
I need 16 cores and a GPU.

Any one of these are do-able and valid when scheduling pods. So, we know it's flexible. Let's actually get started.

Preparing the cluster (on-premise / homelab only)

Most cloud clusters will come prepared for GPU workloads, but in case it doesn't, or you're like me and like the hard-iron approach let's talk about how to pass NVidia GPUs into our containers.

Installing the NVidia Container Toolkit

This is not enabled out of the box when you install k3s, docker, or any of the container runtimes. This is done by installing the NVidia Container Toolkit.

There are multiple how-tos on that site, for mine I am running a Debian based setup, so I followed the Debian steps. Install the drivers, add the repository, run the install command. That was pretty simple honestly, just installing the toolkit. Installing the toolkit is mandatory even if you wanted to simply run a container with docker -run --gpus.

Now comes the more complex part. Making GPUs available to kubernetes. Again for this, I'm using k3s, as described in my previous blog post, many cloud providers will have this set up for you.

Installing the NVidia Device Operator

The next step is to install the NVidia Device Operator. The Device Operator will expose which nodes have GPUs, their health, and how to run GPU workloads. It will automatically apply some labels to your nodes, and expose metadata.

This is available as a helm chart which can be easily applied to your cluster. Really this is a set it and forget it operation, in the now 2 years of running my cluster, I have not needed to even think about this. (I should probably check to see if there's an update...)

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.14.3 --set runtimeClassName=nvidia

Adding the NVidia Runtime Class

The runtime class is available from kubernetes as one of the most low level options for running your containers. Think of it as a way to tell kubernetes that you want to run this pod on docker vs containerd. Well, we're going to add another option, nvidia, we're going to mark our pods as they need to run on the NVidia Container Toolkit described above.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Simply create that as a yaml (I just named mine runtime.yaml), and then apply it with kubectl apply -f runtime.yaml. Congrats, you now have added the nvidia runtime. Pods can now be ran on the NVidia Container Toolkit!

Let's finally start running some workloads!

Running a GPU pod

Here's a very basic Kubernetes job.

I'm choosing a job because most workloads will be a single-time object we want to run, and then close out. There are cases to make it a deployment, where an API will hold onto a GPU indefinitely, but many of our harder to schedule workloads are something that will need a GPU, run for several hours, and then complete.

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    job: myheavyworkload
spec:
  template:
    metadata:
      name: heavyworkload
      labels:
        job: myheavyworkload
    spec:
      containers:
      - name: heavyworkload
        image: busybox:latest
        imagePullPolicy: Always
        env:
        - name: ENVIRONMENT_VARIABLE
           value: foobar
        resources:
          requests:
            cpu: 2
      restartPolicy: Never
  backoffLimit: 4

Let's look at this quickly. This is a fairly simple job, for the example it uses the busybox:latest docker image. It has one environment variable we set, and we set some labels on it. At the bottom, we have some resources, where we request 2 cpus. The job is set to never restart, and if it fails it will attempt 4 more times before giving up.

Hopefully this is pretty standard to people, even if you don't know kubernetes well I'm hoping that reading through that configuration you can see how it's laid out.

So how do we add a GPU to it? It must be very complex right? Well nope, I'd say the most complex things are behind us! Let's see what it takes to add a GPU to this workload now.

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    job: myheavyworkload
spec:
  template:
    metadata:
      name: heavyworkload
      labels:
        job: myheavyworkload
    spec:
      containers:
      - name: heavyworkload
        image: busybox:latest
        imagePullPolicy: Always
        env:
        - name: ENVIRONMENT_VARIABLE
           value: foobar
        resources:
          requests:
            cpu: 2
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never
  backoffLimit: 4

What changed? Well, we set runtimeclass: nvidia, which from above as I described means we explicitly want to use the nvidia container toolkit. Then we also set a limit of one nvidia.com/gpu.

That's it! That's all it takes now to say you want this container to use a GPU. If you request to start that pod, kubernetes will now find a node with a GPU available and start it as soon as it can. If one is not available it will remain Pending until one frees up either from another job finishing, or an autoscaler adding more nodes based on whatever rules you set.

I hope you found this post interesting. GPU workloads add another level of complexity, but the freedom of abstracting GPUs away from specific nodes can give you amazing new opportunities. Scheduling pods, re-runnable jobs, dynamically adding more GPUs, staring jobs programmatically, all of these become incredibly easy with Kubernetes!

Thank you for making it this far, if you have any questions feel free to reach out as always. I'll be adding more social handles, but you can on LinkedIn. Take care everyone!