Run Your Own AI Locally: Ollama + Local LLMs on Your NUC

3/24/2026 ai

Run Your Own AI Locally: Ollama + Local LLMs on Your NUC

Your AI doesn't need to call home to OpenAI. Run it locally, keep your data private, save money.

I run local LLMs on my NUC for personal projects. Zero API costs. Zero data leaks. Just me and the model.

Why Local?

Cloud LLMs (ChatGPT, Claude API):

$0.01-0.10 per request
Your data goes to Anthropic/OpenAI
Rate limited
Requires internet

Local LLMs:

One-time cost (~$0, if you have the hardware)
Your data stays on your machine
No rate limits
Works offline

Tradeoff: Slower (but getting better), less smart (but usable), more setup.

Hardware Requirements

Minimum: 8GB RAM Recommended: 16GB RAM Comfortable: 32GB RAM

Your NUC from the homelab guide? Perfect. 16GB RAM = solid local LLM box.

GPU acceleration (optional):

NVIDIA GPU = 3-10x faster
Apple Silicon = built-in acceleration
Intel iGPU = slight speedup
CPU only = slowest, but works

Best Local Models (2026)

Mistral 7B — Best Speed/Quality

7 billion parameters (small = fast)
Trained on code + general knowledge
Runs on 8GB RAM
Inference: 3-5 tokens/sec (fast enough)

ollama pull mistral
ollama run mistral "What is a homelab?"

Llama 2 13B — Best for Coding

13B params (more capable)
Great at Python/JavaScript/Go
Needs 16GB+ RAM
Slower but smarter

ollama pull llama2:13b

Neural Chat 7B — Best for Chat

Fine-tuned for conversation
Fast, friendly responses
7B = runs anywhere
Best for personal assistant tasks

ollama pull neural-chat

Orca Mini 3B — Best for Tiny Hardware

3B params (runs on 4GB)
Surprisingly capable
Slowest, but works

ollama pull orca-mini

Installation (5 minutes)

Step 1: Get Ollama

# macOS / Linux:
curl https://ollama.ai/install.sh | sh

# Windows:
# Download installer from https://ollama.ai/download

Step 2: Start Ollama

ollama serve
# Default: http://localhost:11434

Step 3: Pull a Model

ollama pull mistral
# Takes 2-10 min depending on internet

Step 4: Run It

ollama run mistral
# Type your prompt
# Ctrl+D to exit

Done. You're running an LLM locally.

Use Cases

1. Personal Assistant

ollama run mistral "Summarize the key points of machine learning"

2. Code Generation

ollama run llama2:13b "Write a Python function to scrape a webpage with BeautifulSoup"

3. Document Q&A (RAG)

Use with LangChain or LlamaIndex to chat with your own documents.

from llama_index import GPTSimpleVectorIndex, Document

docs = [Document(text="Your document here")]
index = GPTSimpleVectorIndex.from_documents(docs, llm_model="local")
response = index.query("What's the main topic?")

4. Homelab Automation

# Run daily summaries, backups analysis, system health reports
# Via cron job + local LLM

API Server (For Apps)

Ollama exposes a local API. Use it from apps:

# Start Ollama server
ollama serve

# Curl test
curl -X POST http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "What is AI?",
    "stream": false
  }'

Use with:

LangChain (Python)
LlamaIndex (Python)
OpenAI API wrapper (any language, just point to localhost:11434)

Performance Tips

GPU Acceleration (NVIDIA)

# Install CUDA
# Ollama will auto-detect and use GPU

# Run bigger models faster
ollama pull llama2:70b
# 70B params = slow on CPU, fast on GPU

Quantization (Smaller Models)

Models come in sizes:

Full (fp32) = 30GB for 70B model
Quantized (int4) = 3GB for 70B model

Quantized = faster + smaller, slightly less accurate.

ollama pull llama2:13b  # Full
ollama pull llama2:13b-q4_K_M  # Quantized (recommended)

Batch Requests

# Don't fire 100 requests at once
# Queue them, process serially
# Ollama has built-in rate limiting

Privacy Wins

✅ Your data never leaves your machine ✅ No tracking, no analytics ✅ No subscription ✅ No rate limits ✅ Fully offline capable (once model is cached)

vs. ❌ ChatGPT sends to OpenAI servers ❌ Claude API tracked by Anthropic ❌ Rate limits + pricing ❌ Internet required

Cost Analysis

Approach	Cost	Privacy	Speed
ChatGPT	$20/month	None	Fast
Claude API	$0.01/req	None	Fast
Local Ollama	$0 (hardware)	100%	Slower

If you already have a NUC? Local LLM is free.

Limitations (Be Real)

Speed: 2-5 tokens/sec vs 20+ for cloud
Intelligence: Mistral 7B ≈ ChatGPT 3.5 (not 4.0)
Hallucinations: More likely than cloud models
Latency: First token takes 1-3 seconds

Best for: Offline work, automation, low-latency needs. Not for: Real-time chat (better to use cloud).

Next Steps

Install Ollama (5 min)
Pull Mistral (5 min download)
Test it (ask some questions)
Integrate with your homelab (Python script, cron job)
Celebrate (you're running your own AI)

Resources

Ollama: https://ollama.ai/
Available Models: https://ollama.ai/library
LangChain: https://github.com/hwchase17/langchain
LlamaIndex: https://github.com/jerryjliu/llama_index

Privacy-first tech is possible. You don't need to send everything to the cloud.

Next article: Building a RAG chatbot with your personal documents (advanced).

Thinking of running local LLMs? Great choice. Your data, your AI, your rules.