Gemma 4 Guides

How to Run Gemma 4 with llama.cpp: GGUF Setup, Hardware & Quantization Guide

Updated Apr 4, 202610 min read
gemma 4llama.cpplocal llmggufsetup guidequantization
How to Run Gemma 4 with llama.cpp: GGUF Setup, Hardware & Quantization Guide

Gemma 4 launched on April 2, 2026 with first-day llama.cpp support. If you already know you want llama.cpp — not Ollama, not LM Studio — this guide gives you the exact commands and hardware numbers to get a stable first run, then scale up from there.

If you are still deciding which local runtime to use, jump to When llama.cpp makes sense first.


Gemma 4 model sizes at a glance

Gemma 4 ships in four variants. Before you download anything, match your hardware against the table below — this is the single most common source of problems.

Variant Architecture Context Modalities 4-bit RAM 8-bit RAM FP16 RAM
E2B Dense + PLE 128K Text, Image, Audio ~4 GB ~5–8 GB ~10 GB
E4B Dense + PLE 128K Text, Image, Audio ~5.5–6 GB ~9–12 GB ~16 GB
26B-A4B MoE (4B active) 256K Text, Image ~16–18 GB ~28–30 GB ~52 GB
31B Dense 256K Text, Image ~17–20 GB ~34–38 GB ~62 GB

RAM here means total available memory — the sum of your VRAM plus system RAM if you are offloading layers, or unified memory on Apple Silicon. If your total falls short of the 4-bit column, llama.cpp can still run the model using partial disk offload, but generation speed will drop noticeably.

Quick picks:

  • Mac mini M4 (16 GB unified memory): E4B at Q8_0, or 26B-A4B at Q4 if you accept slower speeds.
  • 16 GB VRAM (RTX 4080, RTX 4090 12 GB): E4B at Q8_0 comfortably; 26B-A4B at Q4 with room to spare.
  • 24 GB VRAM (RTX 3090 / 4090): 26B-A4B at Q8_0 or 31B at Q4.
  • 8 GB VRAM: E2B or E4B at Q4 only.

26B-A4B vs 31B: The MoE 26B activates only 4 billion parameters per forward pass, making it faster and lighter than the dense 31B. Choose 26B-A4B when speed matters and your RAM is tight; choose 31B when you want the highest quality and have headroom.


When llama.cpp makes sense

llama.cpp is a good fit when you want:

  • Raw control — custom sampling parameters, KV cache tuning, server mode with OpenAI-compatible endpoints, grammar-constrained generation.
  • CPU-primary inference — llama.cpp is among the most optimized C++ runtimes for CPU-only workloads, including AVX2/AVX-512 and Apple Metal.
  • Scripting and CI pipelines — a single binary with no Python dependency makes integration straightforward.
  • Multimodal inference via the llama-mtmd-cli and llama-server with --mmproj.

If you want the easiest possible first run — a one-command download and chat — Ollama or LM Studio are lower-friction entry points. Come back here when you need more control.


Step 1 — Build llama.cpp

Clone the repository first. Always use the master branch — tagged releases lag behind CUDA and Metal fixes:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Linux with NVIDIA GPU (CUDA)

Make sure your CUDA toolkit is installed (nvcc --version to check), then:

apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON

cmake --build llama.cpp/build \
  --config Release -j --clean-first \
  --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp/

Verify GPU offload is working after the build:

./llama.cpp/llama-cli -m your-model.gguf -p "Hello" -n 5 --n-gpu-layers 99

If you see offloaded 0/N layers, the binary was compiled without CUDA — clean the build/ directory and rebuild from scratch.

macOS (Apple Silicon — Metal)

Metal is enabled by default on macOS. You do not need -DGGML_CUDA=ON. Just build normally:

brew install cmake

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=OFF

cmake --build llama.cpp/build \
  --config Release -j --clean-first \
  --target llama-cli llama-mtmd-cli llama-server

cp llama.cpp/build/bin/llama-* llama.cpp/

On Apple Silicon, "VRAM" and system RAM are the same unified memory pool — so a 24 GB M3 Pro can address the full 24 GB for model weights.

CPU only (no GPU)

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=OFF

cmake --build llama.cpp/build \
  --config Release -j$(nproc) \
  --target llama-cli llama-server

cp llama.cpp/build/bin/llama-* llama.cpp/

CMake automatically detects AVX2/AVX-512 on your host CPU and enables the appropriate optimizations. CPU inference is slower but fully functional.


Step 2 — Choose a GGUF and download it

Which quantization to pick

Quantization File size (approx.) Quality Best for
Q8_0 ~1× the Q4 size Closest to FP16 E2B and E4B when you have the RAM headroom
Q4_K_M Medium Good balance 26B-A4B and 31B on 24 GB VRAM
UD-Q4_K_XL Slightly larger than Q4_K_M Better than Q4_K_M 26B-A4B and 31B; Unsloth's Dynamic quantization
Q2_K Smallest Noticeable quality drop Only if you have no other option

The recommended starting points from Unsloth (who maintain the primary GGUF collection):

  • E2B / E4B → start with Q8_0
  • 26B-A4B / 31B → start with UD-Q4_K_XL

Download via Hugging Face CLI

Install the CLI once:

pip install huggingface_hub hf_transfer

Then download your chosen model. For example, the 26B-A4B at UD-Q4_K_XL:

export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"

huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
  --include "*UD-Q4_K_XL*"

For multimodal inference (images), also download the projector file:

huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  --local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
  --include "*mmproj-BF16*" "*UD-Q4_K_XL*"

All four GGUF collections:

  • unsloth/gemma-4-E2B-it-GGUF
  • unsloth/gemma-4-E4B-it-GGUF
  • unsloth/gemma-4-26B-A4B-it-GGUF
  • unsloth/gemma-4-31B-it-GGUF

Step 3 — Run text inference

llama.cpp automatically sets the context length — you do not need to pass -c. Use the parameters below, which match Google's official recommended defaults.

Interactive chat with llama-cli

E4B (Q8_0):

export LLAMA_CACHE="unsloth/gemma-4-E4B-it-GGUF"

./llama.cpp/llama-cli \
  -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  -cnv

26B-A4B (UD-Q4_K_XL):

export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"

./llama.cpp/llama-cli \
  -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  -cnv

31B (UD-Q4_K_XL):

export LLAMA_CACHE="unsloth/gemma-4-31B-it-GGUF"

./llama.cpp/llama-cli \
  -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  -cnv

OpenAI-compatible server (llama-server)

Start a local server on port 8080 that any tool with an OpenAI client can call:

./llama.cpp/llama-server \
  -m unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --port 8080

Then test with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4",
    "messages": [{"role": "user", "content": "Explain attention in one paragraph."}]
  }'

Recommended inference parameters

Parameter Value Notes
--temp 1.0 Google's official default
--top-p 0.95 Google's official default
--top-k 64 Google's official default
--repeat-penalty 1.0 (disabled) Enable only if you see looping
Context length Auto llama.cpp sets this automatically

Context limits: E2B and E4B support up to 128K tokens. 26B-A4B and 31B support up to 256K. Start with 32K in practice for better responsiveness, and only increase if your use case requires long documents.

Enabling thinking mode

Gemma 4 supports a reasoning/thinking mode. To enable it, add <|think|> at the start of your system prompt. To disable it when using the server:

./llama.cpp/llama-server \
  -m your-model.gguf \
  --chat-template-kwargs '{"enable_thinking":false}'

On Windows PowerShell, escape the quotes:

--chat-template-kwargs "{\"enable_thinking\":false}"

Step 4 — Multimodal (image) inference

Gemma 4 supports image inputs in llama.cpp from day one, but it requires a second GGUF file: the multimodal projector (mmproj). The projector handles image encoding before the language model sees it.

What you need

  1. The language model GGUF (same as text inference)
  2. The mmproj-BF16.gguf file from the same Hugging Face repo

You already downloaded both if you used the --include "*mmproj-BF16*" flag above.

Run with llama-mtmd-cli (CLI)

./llama.cpp/llama-mtmd-cli \
  --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

You can then reference images in conversation with the [img]path/to/image.jpg[/img] syntax.

Run with llama-server (API)

./llama.cpp/llama-server \
  --model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  --mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --port 8080

Note on audio: E2B and E4B support audio inputs natively, but audio support in llama.cpp is still being actively developed as of April 2026. Text and image inference are fully stable.


Troubleshooting common issues

"offloaded 0 layers" after build

The CUDA build was not linked correctly. Clean the build directory entirely and rebuild:

rm -rf llama.cpp/build
# Then repeat the cmake steps with -DGGML_CUDA=ON

Out of memory (OOM) at load time

Your total memory is below the model size even after quantization. Options:

  1. Switch to a smaller quantization (Q4_K_M → Q2_K, or UD-Q4_K_XL → Q4_K_M).
  2. Drop to a smaller model variant (31B → 26B-A4B, or 26B-A4B → E4B).
  3. Add --n-gpu-layers N with a lower N to offload fewer layers to VRAM — the rest uses system RAM at reduced speed.

GGML_ASSERT or crash with --image-min-tokens

Do not pass --image-min-tokens with Gemma 4. This flag conflicts with Gemma 4's non-causal attention architecture and causes an assertion failure. Use the default image token budget.

Generation loops or repeats

Add --repeat-penalty 1.05 to break out of repetition loops. Keep it at 1.0 (disabled) in normal operation — Gemma 4's architecture does not require it by default.

Slow generation on macOS despite Metal

Confirm the binary is using Metal:

./llama.cpp/llama-cli -m your-model.gguf -p "hi" -n 1 --verbose

Look for Metal in the backend line. If you see CPU only, set --n-gpu-layers 99 explicitly to force offload.


FAQ

Does llama.cpp officially support Gemma 4? Yes. Gemma 4 support was included at launch on April 2, 2026, with contributions tracked in the llama.cpp repository. All four model sizes work with llama-cli, llama-server, and llama-mtmd-cli.

Can I run Gemma 4 on a Mac mini? Yes. A Mac mini M4 with 16 GB unified memory can run E4B at Q8_0 comfortably, or 26B-A4B at Q4 with acceptable speed. The M4 Pro (24 GB) handles 26B-A4B at Q8_0.

Do I need a GPU? No. llama.cpp runs on CPU only. GPU offload (CUDA or Metal) significantly improves tokens-per-second, but CPU inference is fully supported and practical for smaller models like E2B and E4B.

What is the difference between Q4_K_M and UD-Q4_K_XL? Q4_K_M is standard llama.cpp 4-bit quantization. UD-Q4_K_XL is Unsloth's Dynamic 4-bit format, which applies higher precision to the most important layers and lower precision to less critical ones. In practice, UD-Q4_K_XL is higher quality at a similar file size.

How do I use Gemma 4 with coding agents like Cursor or Continue? Start llama-server on port 8080 (or any port), then point your agent's OpenAI base URL to http://localhost:8080/v1. The /v1/chat/completions endpoint is fully OpenAI-compatible.


Next steps

Once text inference is stable, the natural next steps are:

  • Try the 26B-A4B for a significant quality jump over E4B with only a modest hardware increase.
  • Experiment with multimodal inputs using llama-mtmd-cli if your model is 26B-A4B or smaller.
  • Compare llama.cpp with Ollama if you want a simpler day-to-day workflow and are comfortable trading some control for convenience.

The most common mistake is downloading the largest available model before confirming the smaller one runs well. A stable E4B setup is more useful than a 31B setup that runs at 1 token per second.

Related guides

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.