llama.cpp gets Vulkan backend¶

The wait is over. llama.cpp now ships with a production-ready Vulkan backend, bringing GPU-accelerated inference to virtually any GPU on any platform.

Why this is huge¶

Until now, GPU inference in llama.cpp meant:

CUDA — NVIDIA only
Metal — Apple only
ROCm — AMD Linux only (and finicky)

Vulkan changes everything. It runs on NVIDIA, AMD, Intel, and mobile GPUs across Windows, Linux, and Android. One backend, everywhere.

Quick setup¶

# Build with Vulkan (requires Vulkan SDK)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

# Run any GGUF model on GPU
./build/bin/llama-cli \
  -m mistral-7b-q4_k_m.gguf \
  -ngl 99 \
  -p "Write a haiku about local AI"

The -ngl 99 flag offloads all layers to GPU. Adjust based on VRAM.

Benchmarks¶

AMD RX 7800 XT (16GB) — Mistral 7B Q4_K_M:

Backend	Tokens/sec	Notes
Vulkan	98.2	New default for AMD
ROCm	95.7	Requires Linux + ROCm SDK
CPU (32 threads)	18.4	Ryzen 7950X

Intel Arc A770 (16GB):

Backend	Tokens/sec
Vulkan	72.1
CPU	14.8

Limitations¶

Vulkan compute shaders can't always match hand-tuned CUDA kernels (NVIDIA cards: ~15% slower than CUDA backend)
Requires Vulkan 1.2+ capable GPU
Some exotic quantization types not yet supported

Bottom line¶

If you have an AMD or Intel GPU, this is a game-changer. One-line build, no driver drama, and performance that finally makes local LLMs practical outside the NVIDIA ecosystem.