Llama 4 just dropped — and it runs on a laptop¶
Meta has released Llama 4, and the big story isn't just the benchmark numbers — it's that the 8B model runs smoothly on consumer laptops with minimal quantization.
The numbers¶
We ran Llama 4 8B (Q4_K_M quantization) on a MacBook Pro M3 Pro (18GB) via Ollama and llama.cpp. Here's what we got:
| Metric | Llama 4 8B | Llama 3.1 8B | Improvement |
|---|---|---|---|
| Tokens/sec (prompt) | 112 | 98 | +14% |
| Tokens/sec (generation) | 48 | 42 | +14% |
| MMLU-Pro | 68.2 | 62.5 | +5.7 |
| HumanEval | 81.4 | 76.1 | +5.3 |
| GSM8K | 84.9 | 79.3 | +5.6 |
What's new¶
- Mixture-of-Experts (MoE) architecture: The 8B model uses 2B active parameters per token, dramatically reducing inference cost
- 128K context window: Long-context support out of the box
- Multimodal: Accepts image inputs natively (vision encoder included)
- Apache 2.0 license: Truly open, no restrictions
Getting started¶
# Pull with Ollama
ollama pull llama4:8b
# Or with llama.cpp
wget https://huggingface.co/meta-llama/Llama-4-8B-Instruct-GGUF/resolve/main/llama-4-8b-q4_k_m.gguf
./llama-cli -m llama-4-8b-q4_k_m.gguf -p "Explain transformer attention"
The catch¶
The 70B model needs 32GB+ VRAM even at Q4. The MoE architecture helps with speed but the full weights still need to be loaded. For local developers, 8B is the sweet spot — and it's remarkably capable.
Bottom line¶
Llama 4 8B is the best open model you can run on a laptop right now. Period. The combination of MoE efficiency, 128K context, and native multimodal support makes it the new default for local LLM development.