Skip to content

Llama 4 just dropped — and it runs on a laptop

Meta has released Llama 4, and the big story isn't just the benchmark numbers — it's that the 8B model runs smoothly on consumer laptops with minimal quantization.

The numbers

We ran Llama 4 8B (Q4_K_M quantization) on a MacBook Pro M3 Pro (18GB) via Ollama and llama.cpp. Here's what we got:

Metric Llama 4 8B Llama 3.1 8B Improvement
Tokens/sec (prompt) 112 98 +14%
Tokens/sec (generation) 48 42 +14%
MMLU-Pro 68.2 62.5 +5.7
HumanEval 81.4 76.1 +5.3
GSM8K 84.9 79.3 +5.6

What's new

  • Mixture-of-Experts (MoE) architecture: The 8B model uses 2B active parameters per token, dramatically reducing inference cost
  • 128K context window: Long-context support out of the box
  • Multimodal: Accepts image inputs natively (vision encoder included)
  • Apache 2.0 license: Truly open, no restrictions

Getting started

# Pull with Ollama
ollama pull llama4:8b

# Or with llama.cpp
wget https://huggingface.co/meta-llama/Llama-4-8B-Instruct-GGUF/resolve/main/llama-4-8b-q4_k_m.gguf
./llama-cli -m llama-4-8b-q4_k_m.gguf -p "Explain transformer attention"

The catch

The 70B model needs 32GB+ VRAM even at Q4. The MoE architecture helps with speed but the full weights still need to be loaded. For local developers, 8B is the sweet spot — and it's remarkably capable.

Bottom line

Llama 4 8B is the best open model you can run on a laptop right now. Period. The combination of MoE efficiency, 128K context, and native multimodal support makes it the new default for local LLM development.