Self-Hosted AI Inference: Run LLMs at Scale on Your Homelab
aiSelf-Hosted AI Inference: Run LLMs at Scale on Your Homelab
Ollama is fun for hobby projects. vLLM is for when you're serious.
I run 4 inference servers on my homelab handling 10K+ requests/day. Zero cloud costs. Full control.
This is advanced — skip if you're happy with Ollama.
Ollama vs vLLM vs TGI
| Tool | Ease | Performance | Features | Cost |
|---|---|---|---|---|
| Ollama | ⭐⭐⭐⭐⭐ | Medium | Basic | Free |
| vLLM | ⭐⭐⭐ | Very Fast | Advanced | Free |
| TGI (HF) | ⭐⭐ | Fast | Enterprise | Free |
| Ray Serve | ⭐ | Scalable | Complex | Free |
My choice: vLLM + Kubernetes on homelab.
vLLM (Fastest Open-Source)
What it does:
- Batches requests (Ollama doesn't)
- GPU memory optimization (PagedAttention)
- 5-10x faster than Ollama with same hardware
- Production-ready
Hardware needed:
- NVIDIA GPU (RTX 4090 optimal, RTX 3080 ok, even 2080 works)
- 32GB+ RAM
- 500GB SSD
Installation
# Python 3.10+
pip install vllm
# Start server
python -m vllm.entrypoints.openai_compatible_server \
--model mistral/Mistral-7B-Instruct-v0.1 \
--gpu-memory-utilization 0.9 \
--host 0.0.0.0 \
--port 8000
# Takes 1-2 min to load model, then ready for requests
Load Testing
# Single request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral/Mistral-7B-Instruct-v0.1",
"messages": [{"role": "user", "content": "What is AI?"}]
}'
# Batch 100 requests concurrently
ab -n 100 -c 100 http://localhost:8000/v1/chat/completions
# Real benchmark: ApacheBench, locust, k6
Performance on RTX 3080:
- Ollama: ~4 tokens/sec
- vLLM: ~45 tokens/sec (10x faster!)
- Cost per 1M tokens: $0 (hardware amortized)
Multi-GPU Setup (Scaling)
One GPU maxes out. Add more:
Tensor Parallelism (1 model, multiple GPUs)
python -m vllm.entrypoints.openai_compatible_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 # Split across 4 GPUs
--gpu-memory-utilization 0.9
Llama 70B (~130GB) split across 4x RTX 3080s (~24GB each).
Pipeline Parallelism (Advanced)
For 200B+ models, split model layers.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
pipeline_parallel_size=2, # 2 GPUs per pipeline
tensor_parallel_size=2, # 2 GPUs for tensor ops
)
Result: 4 GPUs running one huge model efficiently.
Kubernetes Deployment
This is where it gets serious.
Single Node (Homelab Box)
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "mistral/Mistral-7B-Instruct-v0.1"
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
nvidia.com/gpu: "1"
memory: "24Gi"
volumeMounts:
- name: models
mountPath: /root/.cache/huggingface
volumes:
- name: models
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-svc
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: LoadBalancer
Deploy:
kubectl apply -f deployment.yaml
# Verify: kubectl port-forward svc/vllm-svc 8000:8000
Multi-Node (Multiple Servers)
# Multiple replicas behind load balancer
spec:
replicas: 3 # Run 3 inference servers
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vllm
topologyKey: kubernetes.io/hostname
Result: 3 servers, each handling requests independently, load-balanced.
Request Batching (The Secret Sauce)
vLLM batches requests automatically. This is why it's fast.
Request 1: "What is AI?" → 100 tokens Request 2: "Summarize quantum..." → 50 tokens
Ollama: Process 1, then 2 (sequential) vLLM: Process both in batch (simultaneous, shared GPU memory)
Result: Much higher throughput.
Monitoring (Prometheus + Grafana)
Track:
- Requests/sec
- Token/sec throughput
- GPU utilization
- Queue depth
- Cache hit rate
# vLLM exports metrics to Prometheus automatically
# Scrape at :8000/metrics
Grafana dashboard:
- Real-time throughput
- GPU memory usage
- Model latency (p50, p99)
- Error rate
Cost Analysis (Homelab vs Cloud)
Cloud (OpenAI API)
- $0.0015 per 1K input tokens
- $0.002 per 1K output tokens
- 1M tokens/month = ~$50
Self-Hosted (vLLM)
- RTX 4090: $1,500 (1-3 year amortization)
- Electricity: $5-10/month
- 1M tokens/month = $0 (hardware cost + power)
Breakeven: 3 months of heavy use.
Gotchas
❌ NVIDIA GPU required — no CUDA = no vLLM advantage ❌ Memory bandwidth is the bottleneck — not compute ❌ HuggingFace model loading is slow — cache properly ❌ Scaling is hard — Kubernetes adds complexity
✅ But when it works? Insanely fast and cheap.
Next Steps
- Get a GPU (RTX 3080+ if possible)
- Install vLLM (pip, takes 5 min)
- Test locally (single model, one GPU)
- Add K8s (if you have multiple servers)
- Monitor & scale (based on load)
Resources
- vLLM Docs: https://github.com/lm-sys/vllm
- LLM Perf Leaderboard: https://huggingface.co/spaces/optimum/llm-perf-leaderboard
- K8s GPU Plugin: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Running inference at scale is hard. But it's worth it once you own the hardware.
Next: Building a RAG pipeline with vLLM + vector DBs (LangChain integration).
Questions on setup? Drop them. I've debugged GPU memory leaks so you don't have to.