Gemma 4 Guides
How to Run Gemma 4 with llama.cpp: GGUF Setup, Hardware & Quantization Guide

Gemma 4 launched on April 2, 2026 with first-day llama.cpp support. If you already know you want llama.cpp — not Ollama, not LM Studio — this guide gives you the exact commands and hardware numbers to get a stable first run, then scale up from there.
If you are still deciding which local runtime to use, jump to When llama.cpp makes sense first.
Gemma 4 model sizes at a glance
Gemma 4 ships in four variants. Before you download anything, match your hardware against the table below — this is the single most common source of problems.
| Variant | Architecture | Context | Modalities | 4-bit RAM | 8-bit RAM | FP16 RAM |
|---|---|---|---|---|---|---|
| E2B | Dense + PLE | 128K | Text, Image, Audio | ~4 GB | ~5–8 GB | ~10 GB |
| E4B | Dense + PLE | 128K | Text, Image, Audio | ~5.5–6 GB | ~9–12 GB | ~16 GB |
| 26B-A4B | MoE (4B active) | 256K | Text, Image | ~16–18 GB | ~28–30 GB | ~52 GB |
| 31B | Dense | 256K | Text, Image | ~17–20 GB | ~34–38 GB | ~62 GB |
RAM here means total available memory — the sum of your VRAM plus system RAM if you are offloading layers, or unified memory on Apple Silicon. If your total falls short of the 4-bit column, llama.cpp can still run the model using partial disk offload, but generation speed will drop noticeably.
Quick picks:
- Mac mini M4 (16 GB unified memory): E4B at Q8_0, or 26B-A4B at Q4 if you accept slower speeds.
- 16 GB VRAM (RTX 4080, RTX 4090 12 GB): E4B at Q8_0 comfortably; 26B-A4B at Q4 with room to spare.
- 24 GB VRAM (RTX 3090 / 4090): 26B-A4B at Q8_0 or 31B at Q4.
- 8 GB VRAM: E2B or E4B at Q4 only.
26B-A4B vs 31B: The MoE 26B activates only 4 billion parameters per forward pass, making it faster and lighter than the dense 31B. Choose 26B-A4B when speed matters and your RAM is tight; choose 31B when you want the highest quality and have headroom.
When llama.cpp makes sense
llama.cpp is a good fit when you want:
- Raw control — custom sampling parameters, KV cache tuning, server mode with OpenAI-compatible endpoints, grammar-constrained generation.
- CPU-primary inference — llama.cpp is among the most optimized C++ runtimes for CPU-only workloads, including AVX2/AVX-512 and Apple Metal.
- Scripting and CI pipelines — a single binary with no Python dependency makes integration straightforward.
- Multimodal inference via the
llama-mtmd-cliandllama-serverwith--mmproj.
If you want the easiest possible first run — a one-command download and chat — Ollama or LM Studio are lower-friction entry points. Come back here when you need more control.
Step 1 — Build llama.cpp
Clone the repository first. Always use the master branch — tagged releases lag behind CUDA and Metal fixes:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Linux with NVIDIA GPU (CUDA)
Make sure your CUDA toolkit is installed (nvcc --version to check), then:
apt-get update
apt-get install -y pciutils build-essential cmake curl libcurl4-openssl-dev
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON
cmake --build llama.cpp/build \
--config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp/
Verify GPU offload is working after the build:
./llama.cpp/llama-cli -m your-model.gguf -p "Hello" -n 5 --n-gpu-layers 99
If you see offloaded 0/N layers, the binary was compiled without CUDA — clean the build/ directory and rebuild from scratch.
macOS (Apple Silicon — Metal)
Metal is enabled by default on macOS. You do not need -DGGML_CUDA=ON. Just build normally:
brew install cmake
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=OFF
cmake --build llama.cpp/build \
--config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
On Apple Silicon, "VRAM" and system RAM are the same unified memory pool — so a 24 GB M3 Pro can address the full 24 GB for model weights.
CPU only (no GPU)
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=OFF
cmake --build llama.cpp/build \
--config Release -j$(nproc) \
--target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
CMake automatically detects AVX2/AVX-512 on your host CPU and enables the appropriate optimizations. CPU inference is slower but fully functional.
Step 2 — Choose a GGUF and download it
Which quantization to pick
| Quantization | File size (approx.) | Quality | Best for |
|---|---|---|---|
| Q8_0 | ~1× the Q4 size | Closest to FP16 | E2B and E4B when you have the RAM headroom |
| Q4_K_M | Medium | Good balance | 26B-A4B and 31B on 24 GB VRAM |
| UD-Q4_K_XL | Slightly larger than Q4_K_M | Better than Q4_K_M | 26B-A4B and 31B; Unsloth's Dynamic quantization |
| Q2_K | Smallest | Noticeable quality drop | Only if you have no other option |
The recommended starting points from Unsloth (who maintain the primary GGUF collection):
- E2B / E4B → start with Q8_0
- 26B-A4B / 31B → start with UD-Q4_K_XL
Download via Hugging Face CLI
Install the CLI once:
pip install huggingface_hub hf_transfer
Then download your chosen model. For example, the 26B-A4B at UD-Q4_K_XL:
export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
--local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
--include "*UD-Q4_K_XL*"
For multimodal inference (images), also download the projector file:
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
--local-dir unsloth/gemma-4-26B-A4B-it-GGUF \
--include "*mmproj-BF16*" "*UD-Q4_K_XL*"
All four GGUF collections:
unsloth/gemma-4-E2B-it-GGUFunsloth/gemma-4-E4B-it-GGUFunsloth/gemma-4-26B-A4B-it-GGUFunsloth/gemma-4-31B-it-GGUF
Step 3 — Run text inference
llama.cpp automatically sets the context length — you do not need to pass -c. Use the parameters below, which match Google's official recommended defaults.
Interactive chat with llama-cli
E4B (Q8_0):
export LLAMA_CACHE="unsloth/gemma-4-E4B-it-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-cnv
26B-A4B (UD-Q4_K_XL):
export LLAMA_CACHE="unsloth/gemma-4-26B-A4B-it-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-cnv
31B (UD-Q4_K_XL):
export LLAMA_CACHE="unsloth/gemma-4-31B-it-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-cnv
OpenAI-compatible server (llama-server)
Start a local server on port 8080 that any tool with an OpenAI client can call:
./llama.cpp/llama-server \
-m unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--port 8080
Then test with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4",
"messages": [{"role": "user", "content": "Explain attention in one paragraph."}]
}'
Recommended inference parameters
| Parameter | Value | Notes |
|---|---|---|
--temp |
1.0 | Google's official default |
--top-p |
0.95 | Google's official default |
--top-k |
64 | Google's official default |
--repeat-penalty |
1.0 (disabled) | Enable only if you see looping |
| Context length | Auto | llama.cpp sets this automatically |
Context limits: E2B and E4B support up to 128K tokens. 26B-A4B and 31B support up to 256K. Start with 32K in practice for better responsiveness, and only increase if your use case requires long documents.
Enabling thinking mode
Gemma 4 supports a reasoning/thinking mode. To enable it, add <|think|> at the start of your system prompt. To disable it when using the server:
./llama.cpp/llama-server \
-m your-model.gguf \
--chat-template-kwargs '{"enable_thinking":false}'
On Windows PowerShell, escape the quotes:
--chat-template-kwargs "{\"enable_thinking\":false}"
Step 4 — Multimodal (image) inference
Gemma 4 supports image inputs in llama.cpp from day one, but it requires a second GGUF file: the multimodal projector (mmproj). The projector handles image encoding before the language model sees it.
What you need
- The language model GGUF (same as text inference)
- The
mmproj-BF16.gguffile from the same Hugging Face repo
You already downloaded both if you used the --include "*mmproj-BF16*" flag above.
Run with llama-mtmd-cli (CLI)
./llama.cpp/llama-mtmd-cli \
--model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
--temp 1.0 \
--top-p 0.95 \
--top-k 64
You can then reference images in conversation with the [img]path/to/image.jpg[/img] syntax.
Run with llama-server (API)
./llama.cpp/llama-server \
--model unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--mmproj unsloth/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--port 8080
Note on audio: E2B and E4B support audio inputs natively, but audio support in llama.cpp is still being actively developed as of April 2026. Text and image inference are fully stable.
Troubleshooting common issues
"offloaded 0 layers" after build
The CUDA build was not linked correctly. Clean the build directory entirely and rebuild:
rm -rf llama.cpp/build
# Then repeat the cmake steps with -DGGML_CUDA=ON
Out of memory (OOM) at load time
Your total memory is below the model size even after quantization. Options:
- Switch to a smaller quantization (Q4_K_M → Q2_K, or UD-Q4_K_XL → Q4_K_M).
- Drop to a smaller model variant (31B → 26B-A4B, or 26B-A4B → E4B).
- Add
--n-gpu-layers Nwith a lower N to offload fewer layers to VRAM — the rest uses system RAM at reduced speed.
GGML_ASSERT or crash with --image-min-tokens
Do not pass --image-min-tokens with Gemma 4. This flag conflicts with Gemma 4's non-causal attention architecture and causes an assertion failure. Use the default image token budget.
Generation loops or repeats
Add --repeat-penalty 1.05 to break out of repetition loops. Keep it at 1.0 (disabled) in normal operation — Gemma 4's architecture does not require it by default.
Slow generation on macOS despite Metal
Confirm the binary is using Metal:
./llama.cpp/llama-cli -m your-model.gguf -p "hi" -n 1 --verbose
Look for Metal in the backend line. If you see CPU only, set --n-gpu-layers 99 explicitly to force offload.
FAQ
Does llama.cpp officially support Gemma 4?
Yes. Gemma 4 support was included at launch on April 2, 2026, with contributions tracked in the llama.cpp repository. All four model sizes work with llama-cli, llama-server, and llama-mtmd-cli.
Can I run Gemma 4 on a Mac mini? Yes. A Mac mini M4 with 16 GB unified memory can run E4B at Q8_0 comfortably, or 26B-A4B at Q4 with acceptable speed. The M4 Pro (24 GB) handles 26B-A4B at Q8_0.
Do I need a GPU? No. llama.cpp runs on CPU only. GPU offload (CUDA or Metal) significantly improves tokens-per-second, but CPU inference is fully supported and practical for smaller models like E2B and E4B.
What is the difference between Q4_K_M and UD-Q4_K_XL?
Q4_K_M is standard llama.cpp 4-bit quantization. UD-Q4_K_XL is Unsloth's Dynamic 4-bit format, which applies higher precision to the most important layers and lower precision to less critical ones. In practice, UD-Q4_K_XL is higher quality at a similar file size.
How do I use Gemma 4 with coding agents like Cursor or Continue?
Start llama-server on port 8080 (or any port), then point your agent's OpenAI base URL to http://localhost:8080/v1. The /v1/chat/completions endpoint is fully OpenAI-compatible.
Next steps
Once text inference is stable, the natural next steps are:
- Try the 26B-A4B for a significant quality jump over E4B with only a modest hardware increase.
- Experiment with multimodal inputs using
llama-mtmd-cliif your model is 26B-A4B or smaller. - Compare llama.cpp with Ollama if you want a simpler day-to-day workflow and are comfortable trading some control for convenience.
The most common mistake is downloading the largest available model before confirming the smaller one runs well. A stable E4B setup is more useful than a 31B setup that runs at 1 token per second.
Related guides
Related guides
Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Does llama.cpp Support Gemma 4? GGUF Status, Fixes, and What Works
A practical answer to whether llama.cpp supports Gemma 4, with the official GGUF links, current support status, and what 'supported' really means.

Does LM Studio Support Gemma 4? Compatibility, Model List, and Requirements
A clear answer to whether LM Studio supports Gemma 4, with the supported model list, minimum memory, and practical setup expectations.

Gemma 4 API Guide: Local OpenAI-Compatible Setup
Use this Gemma 4 API guide to build a local OpenAI-compatible endpoint, test it quickly, and choose the right runtime for your workflow.
Still deciding what to read next?
Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.
