Ollama 1.0 is here¶
After two years of rapid iteration, Ollama has reached 1.0. The milestone release brings stability guarantees and a handful of long-requested features.
What's new in 1.0¶
Multi-GPU support¶
Ollama can now split models across multiple GPUs automatically:
# No config needed — Ollama detects and uses all GPUs
ollama run llama4:70b # Splits across 2x 24GB GPUs
Model caching¶
Frequently used models are now kept warm in VRAM, cutting cold-start latency from seconds to milliseconds:
ollama cache warm llama4:8b # Pre-load into VRAM
ollama cache list # View cached models
ollama cache drop llama4:8b # Free VRAM
Streaming API improvements¶
The REST API now supports Server-Sent Events (SSE) with per-token metadata:
{"token": "Hello", "prob": 0.98, "speed": 48.2, "index": 0}
{"token": " world", "prob": 0.95, "speed": 48.5, "index": 1}
Breaking changes (minimal!)¶
- The
Modelfilesyntax is versioned (FROM llama4:8bis nowFROM ollama://llama4:8b) - Deprecated
ollama create --from; useollama pullinstead
Migration guide¶
# Upgrade
curl -fsSL https://ollama.com/install.sh | sh
# Check version
ollama --version # 1.0.0
# Migrate Modelfiles (automatic for most cases)
ollama migrate
What's next¶
The roadmap for 1.x includes built-in function calling, speculative decoding, and a web dashboard. But 1.0 is already the most polished local LLM experience available — and it's still free.