Gemma 4 Guides

Gemma 4 Q4 vs Q8: Which Quantization to Actually Download

7 min read
gemma 4q4q8ggufquantization
Gemma 4 Q4 vs Q8: Which Quantization to Actually Download

When you download a Gemma 4 GGUF, you are choosing a compression level. The number in the filename (Q4, Q5, Q8) is how many bits are used per model weight. Lower bits means smaller file, less RAM, and some quality loss. Higher bits means larger file, more RAM, and output closer to the full-precision model.

The right choice for most people: start with Q4_K_M. Move to Q5_K_M if you want noticeably better output for reasoning or coding and your hardware has room. Move to Q8 only if you have confirmed that Q4 is not good enough for your task and memory is not a constraint.

There is also a newer option — QAT — that changes this calculus. More on that below.

The GGUF naming system

On Hugging Face, model files follow a pattern like Q4_K_M, Q5_K_S, Q8_0. What each part means:

  • Q4 = 4-bit quantization (4 bits stored per model weight)
  • K = K-quant format: mixed precision that keeps more sensitive layers at higher precision
  • M = medium variant (S is smaller, L is larger within the K-quant family)
  • Q8_0 = 8-bit, the legacy zero-point format
  • Q4_0 = 4-bit, the legacy zero-point format (worse than Q4_K_M at the same size)

The most important thing here: Q4_0 and Q4_K_M are not equivalent. K-quant formats use mixed precision across layer types. In practice, Q4_K_M produces noticeably better output than Q4_0 for essentially the same file size. If you have a choice between the two, always pick Q4_K_M.

Memory requirements

Google's official figures (with ~20% overhead). Unsloth's practical measurements put the 26B A4B Q4 load at closer to 18 GB.

Model Q4_K_M Q8_0 BF16
Gemma 4 E2B ~2.9 GB ~5.7 GB ~11.4 GB
Gemma 4 E4B ~4.5 GB ~8.9 GB ~17.9 GB
Gemma 4 12B ~6.7 GB ~13.4 GB ~26.7 GB
Gemma 4 26B A4B ~14.4–18 GB ~28 GB ~52–58 GB
Gemma 4 31B ~17.5 GB ~34.9 GB ~69.9 GB

These are model-load estimates. Add KV cache on top (grows with context length and batch size). For long-context use, KV cache memory can exceed model weight memory.

Where the quality difference actually appears

Research on quantization quality is consistent: casual chat, summarization, and extraction are highly resilient to quantization. The perplexity difference between Q4_K_M and Q8 on conversational tasks is in the hundredths of a point — not perceptible in normal use.

The gap becomes visible in tasks where precision accumulates across many steps:

  • Multi-step reasoning chains (quantization error compounds at each step)
  • Complex code generation and refactoring (precise token predictions matter more)
  • Math-heavy tasks
  • Long-context work where earlier context must influence a later conclusion precisely
  • Structured output where every field must follow a strict schema

For the majority of local use cases — chat, document Q&A, writing assistance, simple coding help — Q4_K_M is genuinely sufficient. If you are running a coding agent or a complex reasoning pipeline, it is worth testing Q8 before committing.

The underrated middle option: Q5_K_M

Q5_K_M sits between Q4 and Q8 and is often the right answer when:

  • Your system has memory headroom beyond what Q4 needs
  • You are doing coding or reasoning work where Q4 occasionally feels unreliable
  • You do not want the full 2× memory hit of Q8

For example: on a 32 GB system running 26B A4B, Q5_K_M uses roughly 20–22 GB and delivers noticeably better output than Q4_K_M for a manageable memory increase. Q8 would require ~28 GB, leaving little room for context.

If Q4 barely fits on your system, Q5 will not. But if you have comfortable headroom, Q5_K_M is worth considering before jumping directly to Q8.

Which file to download for your hardware

Your setup Start with
8 GB RAM laptop E2B Q4_K_M, or E4B Q4_K_M if it fits
16 GB system E4B Q4_K_M
24 GB GPU 26B A4B Q4_K_M
32 GB system 26B A4B Q4_K_M comfortably; try Q5_K_M if it fits
48 GB+ 26B A4B Q8, or 31B Q4_K_M
64 GB+ workstation 31B Q8, or 26B A4B Q8

If the model barely fits at Q4, do not force Q8. Choose a smaller model at Q5 or Q6 instead. A properly-sized model under no memory pressure consistently outperforms a larger model that is constantly swapping or running near its limit.

Gemma 4 QAT: the option that changes the math

Google released Quantization-Aware Training (QAT) versions of Gemma 4 on June 5, 2026. QAT models are trained with quantization simulation built into the training loop — the model learns to compensate for precision loss rather than having compression applied after the fact.

The result: a QAT Q4 model performs noticeably better than a standard post-training Q4 model of the same size, sometimes approaching Q8 standard quality.

For GGUF use, there are two relevant paths:

  1. Google's official QAT GGUF (Q4_0 format): Available directly on Hugging Face under google/gemma-4-*-it-qat-q4_0-gguf. Note that naive conversion of the QAT checkpoint to llama.cpp's Q4_0 format loses some of the QAT quality benefit.

  2. Unsloth's UD-Q4_K_XL GGUFs: Unsloth applied their dynamic method to the QAT checkpoints and recovered 8–15 percentage points of top-1 accuracy compared to naive conversion, while also producing smaller files. Their files are named UD-Q4_K_XL and are published at unsloth/gemma-4-*-it-qat-GGUF.

If you are comparing standard Q4_K_M against Unsloth's QAT UD-Q4_K_XL: the QAT version is better at the same memory footprint. It is the first thing to try for 4-bit inference.

IQ4_XS: the size-optimized alternative

IQ4_XS uses importance matrix calibration to preserve the most critical weights at higher precision within a smaller overall file. When properly calibrated, it can match Q4_K_M quality at roughly 9–10% smaller size. Look for files tagged "imatrix" from trusted publishers.

This is a secondary optimization. Start with Q4_K_M (or QAT) from a known publisher before hunting for imatrix versions.

What to avoid

Q3 and Q2: Quality degrades sharply below Q4 for most tasks. Arithmetic reasoning in particular shows a measurable accuracy cliff. Avoid unless you have a very specific memory-constrained reason.

Q8 "just to be safe": Q8 files are roughly 2× the size of Q4. If you are not sure whether Q8 helps your use case, test Q4 first and upgrade only if the output is not good enough.

Obscure publisher GGUFs: Stick to ggml-org, unsloth, bartowski, or mradermacher. Unknown publishers may produce GGUFs with incorrect quantization, wrong tokenizer configuration, or other issues that manifest as strange model behavior rather than obvious errors.

FAQ

Is Q8 always better than Q4?
Better in isolation, yes. But if Q8 forces your system to swap memory constantly, Q4 with comfortable headroom will produce more consistent results. The best quantization is the one your hardware can run without pressure.

Should I use QAT or standard quantization?
If a QAT GGUF from Unsloth or Google is available for your model size, it is the better choice at the 4-bit level. The QAT training specifically improves 4-bit precision.

What is the difference between Q4_0 and Q4_K_M?
Q4_K_M uses mixed precision across different layer types, keeping sensitive layers at higher precision. Q4_0 treats all layers uniformly at 4 bits. Q4_K_M is almost always better. Always choose it over Q4_0 when available.

Does quantization affect context window length?
Indirectly. Lower-precision weights use less RAM, leaving more room for the KV cache. A smaller quantization level can support longer effective contexts on the same hardware before you run out of memory.

Related guides:

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.