Gemma 4 Guides

Does DiffusionGemma Work with llama.cpp? The Actual Status

7 min read
diffusiongemmallama.cppgguflocal llmtroubleshooting
Does DiffusionGemma Work with llama.cpp? The Actual Status

The short answer: standard llama.cpp cannot run DiffusionGemma. Support exists in pull request #24423, which is unmerged as of this writing. That PR adds a new dedicated binary called llama-diffusion-cli — running the standard llama-cli against a DiffusionGemma GGUF will fail with error loading model: unknown model architecture: 'diffusion-gemma'.

Why DiffusionGemma needs its own binary

DiffusionGemma is not a renamed Gemma 4 checkpoint. It uses discrete text diffusion: instead of predicting one token at a time left-to-right, it starts with a fully masked 256-token canvas and repeatedly denoises the whole block in parallel. This requires bidirectional attention during generation, custom sampling behavior at each denoising step, and a fundamentally different model runner — none of which exist in the standard llama.cpp autoregressive path.

PR #24423 implements this as a separate binary (llama-diffusion-cli) rather than patching the existing llama-cli. Until that PR merges into main, no official llama.cpp release will contain it.

What is PR #24423 and how do you use it

PR #24423 is authored by danielhanchen (the Unsloth founder) and adds the diffusion-gemma architecture to the llama.cpp codebase. The PR has been active since DiffusionGemma's launch on June 10, 2026, and community members have published unofficial prebuilt binaries for Linux/WSL2 CUDA and Windows CPU while the PR is pending.

To build from the PR branch yourself:

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Fetch and check out the PR branch
git fetch origin pull/24423/head:diffusion-gemma-pr
git checkout diffusion-gemma-pr

# Build (CUDA example)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# The new binary will be at:
./build/bin/llama-diffusion-cli

For CPU-only builds, omit the -DGGML_CUDA=ON flag.

Running a model

Download a trusted DiffusionGemma GGUF (Unsloth publishes the most widely used ones at unsloth/diffusiongemma-26B-A4B-it-GGUF on Hugging Face), then:

./build/bin/llama-diffusion-cli \
  -m ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -p "Explain the difference between diffusion and autoregressive text generation." \
  --diffusion-steps 128

The --diffusion-steps parameter controls how many denoising passes the model runs. More steps = higher quality, slower generation. Start with 128 and adjust from there.

Memory requirements

The model is based on the Gemma 4 26B A4B MoE architecture, so memory requirements are the same:

  • Q4_K_M: approximately 14.4–18 GB (Unsloth's practical measurement is 18 GB)
  • Q8: approximately 28 GB
  • BF16: approximately 52–58 GB

The model activates only ~3.8B parameters per forward pass despite having 26B total parameters loaded in memory.

Speed: what the numbers actually mean

Google claims up to 4x faster generation than standard Gemma 4 and over 1,000 tokens per second on a single H100. Those numbers are real on the hardware they describe. What they do not tell you:

The speed advantage is conditional. DiffusionGemma's parallel generation shifts the computational profile from memory-bandwidth-bound (what autoregressive models are) to compute-bound (what diffusion models are). On high-end NVIDIA GPUs with abundant compute — RTX 3090, 4090, A100, H100 — this works in DiffusionGemma's favor. On lower-end GPUs (RTX 3060, 4060) and on Apple Silicon, the compute gap reverses and the speed advantage may disappear entirely. Check your hardware against a benchmark before expecting the headline numbers.

Quality is lower. Google explicitly states that DiffusionGemma's overall output quality is below standard Gemma 4. This is not a temporary limitation — it is the fundamental speed-quality tradeoff of the diffusion approach.

Runtime comparison table

Runtime DiffusionGemma status (June 2026)
llama.cpp (main) Not supported. unknown model architecture: 'diffusion-gemma'
llama.cpp (PR #24423) Supported via llama-diffusion-cli. Must build from PR branch.
Unsloth Studio Supported as of v0.1.463-beta / 2026.6.6. Easiest local path.
Ollama Not supported. Issue #16664 open.
LM Studio Not supported. Bundled runtime does not include PR #24423.
vLLM Fully supported since June 10, 2026. Best path for serving.
HF Transformers Supported via official Google release.

Which path to use

If you want a local GUI with minimal setup: Use Unsloth Studio. It supports DiffusionGemma natively as of its June 12 release and handles the inference parameters automatically.

If you are comfortable with the command line: Build from PR #24423 and use llama-diffusion-cli directly. This gives you the most control over diffusion parameters.

If you are running a Python environment: Use Hugging Face Transformers with the official google/diffusiongemma-26B-A4B-it weights.

If you need to serve multiple users: vLLM has native support as the first inference engine to fully integrate DiffusionGemma.

If you use Ollama or standard LM Studio: Wait. Both are blocked on the same underlying PR and there is no workaround that does not involve building custom binaries.

FAQ

Can I just update llama.cpp from main and get DiffusionGemma support?
No. PR #24423 is not merged into main. Updating from the official repo will not add diffusion-gemma architecture support.

Is there a prebuilt llama-diffusion-cli binary I can download?
Unofficial community prebuilds exist for Linux/WSL2 CUDA (sm_86, RTX 30-series) and Windows CPU. Search GitHub for "llama-diffusion-cli-prebuilt". These are not official Anthropic or ggml-org releases.

Does DiffusionGemma produce better output than regular Gemma 4?
No. Google explicitly says output quality is lower than standard Gemma 4. The advantage is speed, particularly for code infilling and inline editing workflows where you can accept a quality trade.

Why does Ollama fail even though it wraps llama.cpp?
Ollama bundles its own version of llama.cpp that lags behind upstream. Even if you update Ollama, the bundled runtime does not include PR #24423.

Related guides:

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.