Skip to content

MLX for text: Apple's secret weapon for local AI

Apple's MLX framework was initially focused on image models. With the latest release, text generation is now a first-class citizen — and the results on Apple Silicon are stunning.

Why MLX matters

Unlike llama.cpp (which uses Metal shaders) or Ollama (which wraps llama.cpp), MLX is a native Apple Silicon framework from Apple's ML research team. It uses:

  • Unified memory: No CPU/GPU copies — weights live in one place
  • Lazy computation: Builds a compute graph, then optimizes and executes
  • Multi-device: Single API for CPU, GPU, and ANE (Neural Engine)

The result? Performance that sometimes exceeds llama.cpp by 20-30% on the same hardware.

Benchmarks

Tested on MacBook Pro M3 Pro (18GB) with Mistral 7B Q4:

Framework Prompt tok/s Gen tok/s Memory
MLX (text) 145 55 6.2 GB
llama.cpp 112 42 6.8 GB
Ollama 110 41 6.8 GB

Getting started

pip install mlx mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

response = generate(
    model, tokenizer,
    prompt="Explain quantum computing in simple terms",
    max_tokens=256,
    temp=0.7,
)

print(response)

The catch

MLX only runs on Apple Silicon (M1/M2/M3/M4). There's no x86 or CUDA support. It's also younger than llama.cpp, so the model ecosystem is smaller. But for Mac developers, it's already the fastest option.

Bottom line

If you develop on a Mac, MLX deserves a serious look. The performance advantage is real, and Apple's backing means it will only improve.