llama.cpp gets Vulkan backend¶
The wait is over. llama.cpp now ships with a production-ready Vulkan backend, bringing GPU-accelerated inference to virtually any GPU on any platform.
Why this is huge¶
Until now, GPU inference in llama.cpp meant:
- CUDA — NVIDIA only
- Metal — Apple only
- ROCm — AMD Linux only (and finicky)
Vulkan changes everything. It runs on NVIDIA, AMD, Intel, and mobile GPUs across Windows, Linux, and Android. One backend, everywhere.
Quick setup¶
# Build with Vulkan (requires Vulkan SDK)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release
# Run any GGUF model on GPU
./build/bin/llama-cli \
-m mistral-7b-q4_k_m.gguf \
-ngl 99 \
-p "Write a haiku about local AI"
The -ngl 99 flag offloads all layers to GPU. Adjust based on VRAM.
Benchmarks¶
AMD RX 7800 XT (16GB) — Mistral 7B Q4_K_M:
| Backend | Tokens/sec | Notes |
|---|---|---|
| Vulkan | 98.2 | New default for AMD |
| ROCm | 95.7 | Requires Linux + ROCm SDK |
| CPU (32 threads) | 18.4 | Ryzen 7950X |
Intel Arc A770 (16GB):
| Backend | Tokens/sec |
|---|---|
| Vulkan | 72.1 |
| CPU | 14.8 |
Limitations¶
- Vulkan compute shaders can't always match hand-tuned CUDA kernels (NVIDIA cards: ~15% slower than CUDA backend)
- Requires Vulkan 1.2+ capable GPU
- Some exotic quantization types not yet supported
Bottom line¶
If you have an AMD or Intel GPU, this is a game-changer. One-line build, no driver drama, and performance that finally makes local LLMs practical outside the NVIDIA ecosystem.