← Back to Blog

Running LLMs on Kubernetes

Lena Fuhrimann
Running LLMs on Kubernetes
cloudinfrastructurestoragescalingserverless

Large language models (LLMs) power many modern apps. Chatbots, coding helpers, and document tools all use them. The question isn’t whether you need LLMs, but how to run them well. Kubernetes helps you deploy and manage these heavy workloads next to your other services.

Running LLMs on Kubernetes gives you a few benefits. You get a standard way to deploy them. You can easily manage GPU resources. Furthermore, you can scale up when demand grows. Most importantly, though, you can keep your data private by hosting models yourself instead of calling external APIs.

Here, we’ll look at two ways to deploy LLMs on Kubernetes. First, we’ll cover Ollama for simple setups. Then, we’ll explore vLLM Production Stack for high-traffic scenarios.

Ollama

Ollama is popular because it’s easy to use. You download a model, and it just works. The Ollama Helm Chart brings this same ease to Kubernetes.

What You Need

For CPU-only use, you need Kubernetes 1.16.0 or newer. For GPU support with NVIDIA or AMD cards, you need Kubernetes 1.26.0 or newer.

How to Install

Add the Helm repository and install:

helm repo add otwld https://helm.otwld.com/
helm repo update
helm install ollama otwld/ollama --namespace ollama --create-namespace

This sets up Ollama with good defaults. The service runs on port 11434. You can use the normal Ollama tools to talk to it.

To test your deployment, forward the port to your local machine and run a model:

kubectl port-forward -n ollama svc/ollama 11434:11434

Then, in another terminal, you can interact with Ollama:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Adding GPU Support

To use a GPU, create a file called values.yaml:

ollama:
  gpu:
    enabled: true
    type: "nvidia"
    number: 1

Then update the install:

helm upgrade ollama otwld/ollama --namespace ollama --values values.yaml

Downloading Models Early

Ollama downloads models when you first ask for them. This can be slow. You can tell it to download models when the pod starts:

ollama:
  gpu:
    enabled: true
    type: "nvidia"
    number: 1
  models:
    pull:
      - llama3.2:1b

Making Custom Models

You can also make custom versions of models with different settings:

ollama:
  models:
    create:
      - name: llama3.2-1b-large-context
        template: |
          FROM llama3.2:1b
          PARAMETER num_ctx 32768
    run:
      - llama3.2-1b-large-context

This creates a version of Llama 3.2 that can handle longer text.

Opening Access from Outside

To let people reach the API from outside the cluster, add an Ingress:

ollama:
  models:
    pull:
      - llama3.2:1b
ingress:
  enabled: true
  hosts:
    - host: ollama.example.com
      paths:
        - path: /
          pathType: Prefix

Now you can reach the API at ollama.example.com.

When to Use Ollama

Ollama is great when you want things simple. It works well for getting started fast, running different models without much setup, or when you don’t need to handle lots of traffic. If you’ve used Ollama on your laptop, using it on Kubernetes will feel familiar.

vLLM

vLLM is built for speed. It uses performance optimizations like Paged Attention, Continuous Batching, and Prefix Caching to handle many requests at once. The vLLM Production Stack wraps vLLM in a Kubernetes-friendly package with routing, monitoring, and caching.

How It Works

The stack has three main parts. First, serving engines run the LLMs. Second, a router sends requests to the right place. Third, monitoring tools (Prometheus and Grafana) show you what’s happening.

This setup lets you grow from one instance to many without changing your app code. The router uses an API that works like OpenAI’s, so you can swap it in easily.

Production Stack Architecture

How to Install

Add the Helm repository and install with a config file:

helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f values.yaml

A simple values.yaml looks like this:

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.2-3B-Instruct"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1

After it’s ready, you’ll see two pods. One is the router. One runs the model:

NAME                                          READY   STATUS    AGE
vllm-deployment-router-859d8fb668-2x2b7       1/1     Running   2m
vllm-llama3-deployment-vllm-84dfc9bd7-vb9bs   1/1     Running   2m

Using the API

Forward the router to your machine:

kubectl port-forward svc/vllm-router-service 30080:80

Check which models are available:

curl http://localhost:30080/v1/models

Send a chat message:

curl -X POST http://localhost:30080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

Running More Than One Model

You can run different models at the same time. The router sends each request to the right one:

servingEngineSpec:
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.2-3B-Instruct"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1
    - name: "mistral"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "mistralai/Mistral-7B-Instruct-v0.3"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "24Gi"
      requestGPU: 1

Logging In to Hugging Face

Some models need a Hugging Face account. You can add your token like this:

servingEngineSpec:
  modelSpec:
    - name: "llama3"
      repository: "vllm/vllm-openai"
      tag: "latest"
      modelURL: "meta-llama/Llama-3.2-3B-Instruct"
      replicaCount: 1
      requestCPU: 6
      requestMemory: "16Gi"
      requestGPU: 1
      env:
        - name: HF_TOKEN
          value: "your-huggingface-token"

For real deployments, store the token in a Kubernetes Secret instead.

Watching How It Runs

The stack comes with a Grafana dashboard. It shows you how many instances are healthy, how fast requests finish, how long users wait for the first response, how many requests are running or waiting, and how much GPU memory the cache uses. This helps you spot problems and plan for growth.

When to Use vLLM

Use vLLM and its production stack when you need to handle lots of requests fast. Its router is smart about reusing cached work, which saves time and money. The OpenAI-style API makes it easy to plug into existing apps. The monitoring tools help you run it well in production.

Wrapping Up

Ollama and vLLM serve different needs.

Ollama with its Helm chart gets you running fast with little setup. It’s good for development, lighter workloads, and teams that want things simple.

vLLM Production Stack gives you the tools for heavy traffic. The router, multi-model support, and monitoring make it fit for production where speed and uptime matter.

Both use standard Kubernetes and Helm, so they’ll feel familiar if you know containers. Pick based on how much traffic you expect and how much complexity you’re willing to manage.

“Ollama runs one, vLLM a batch — take your time, pick a match”

Lena F.

Do you want to run your own LLMs on Kubernetes? Check out our Pragmatic AI Adoption service and book your free workshop.