Running Models¶

A practical guide to running models of any size on your hardware.

Understanding model sizes¶

The "B" in model names (7B, 13B, 70B) means billions of parameters. More parameters = smarter model, but more RAM needed.

Quantization explained¶

Raw models use 16-bit floats (2 bytes per parameter). Quantization reduces precision to save space:

Format	Bits/param	7B model size	Quality loss
FP16	16	14 GB	None (reference)
Q8_0	8	7 GB	Negligible
Q6_K	6	5.3 GB	Barely perceptible
Q5_K_M	5	4.5 GB	Minor
Q4_K_M	4	4.1 GB	Small (sweet spot)
Q3_K_M	3	3.3 GB	Noticeable
Q2_K	2	2.8 GB	Significant

The sweet spot

Q4_K_M is the recommended quantization for most users. It halves the size with minimal quality loss.

GPU offloading¶

Ollama¶

# Ollama auto-detects GPU — just run
ollama run mistral:7b

llama.cpp¶

# Offload N layers to GPU (33 layers for Mistral 7B)
./llama-cli -m model.gguf -ngl 33

# Offload everything
./llama-cli -m model.gguf -ngl 99

# Check how many layers your model has
./llama-cli -m model.gguf --verbose-prompt 2>&1 | grep "n_layer"

Running 70B+ models locally¶

Yes, it's possible. Here's how:

Multi-GPU (Ollama)CPU-only (llama.cpp)Quantize further

# Ollama 1.0+ auto-splits across GPUs
ollama run llama4:70b

# With 64GB+ RAM, 70B Q4 works on CPU
./llama-cli \
  -m llama-4-70b-q4_k_m.gguf \
  -t 16 \          # 16 threads
  -c 4096 \        # 4K context
  -n -1            # Unlimited output

# Use IQ3_XXS for 70B on 32GB RAM
./llama-quantize \
  llama-4-70b-f16.gguf \
  llama-4-70b-iq3_xxs.gguf \
  IQ3_XXS

Performance tuning¶

# llama.cpp flags that matter
./llama-cli \
  -m model.gguf \
  -t 8 \              # Thread count (match physical cores)
  -ngl 99 \           # GPU layers
  -c 8192 \           # Context size (bigger = more RAM)
  -n 512 \            # Max output tokens
  --mlock \           # Pin memory (prevents swapping)
  --no-mmap \         # Disable mmap if having issues
  --temp 0.7          # Temperature (0 = deterministic, 1 = creative)

Troubleshooting¶

Problem	Likely fix
Out of memory	Reduce context (`-c 2048`), use lower quant, or smaller model
Slow on GPU	Check `-ngl` is set, update GPU drivers
Garbled output	Wrong prompt format — check the model card for chat template
Model won't load	Verify GGUF format, check file isn't corrupted (`md5sum`)