Gemma 4 Guides

Gemma 4 A4B vs E4B: What the Names Actually Mean and Which to Run

7 min read
gemma 4a4be4bmodel comparisonlocal llm
Gemma 4 A4B vs E4B: What the Names Actually Mean and Which to Run

The naming confuses almost everyone. Both models have "4B" in the name, but the "4B" means different things in each case, and the two models have completely different architectures. Here is what you actually need to know.

What E4B means

The "E" in E4B stands for effective parameters, not edge or efficient. Google uses a technique called Per-Layer Embeddings (PLE): each decoder layer has its own small embedding table that feeds a residual signal into that layer's computation. These tables are large on disk but cheap to compute, which is why the model behaves like a 4.5B-parameter model at runtime while the total weight count reaches roughly 8B with embeddings included.

The result is a compact model that carries more representational depth than its parameter count suggests. E4B is designed for phones and laptops — it targets the 8–16 GB RAM range.

E4B also supports audio input natively, which the larger 26B A4B does not. If audio is part of your use case, E4B is currently the largest local model that handles it.

Context window: 128K tokens.

What 26B A4B means

The "A" in 26B A4B stands for active parameters. The 26B A4B is a Mixture-of-Experts (MoE) model with approximately 25.2 billion total parameters, but only around 3.8 billion are active during any single inference step. At runtime it behaves almost as fast as a 4B model, but it draws on a much richer set of learned expert weights.

This is why A4B feels much stronger than E4B on complex tasks: the model has far more total knowledge, even though inference speed is similar. The memory cost, however, is the full 26B worth of weights sitting in RAM — you need to load all of it even though only a fraction activates per token.

Context window: 256K tokens. No native audio input.

Memory requirements

These numbers are from Google's official model overview, with 20% overhead assumed. The Unsloth documentation reports the 26B A4B Q4 load at roughly 18 GB in practice, which is higher than Google's baseline estimate.

Model Q4 Q8 BF16
Gemma 4 E2B ~2.9 GB ~5.7 GB ~11.4 GB
Gemma 4 E4B ~4.5 GB ~8.9 GB ~17.9 GB
Gemma 4 12B ~6.7 GB ~13.4 GB ~26.7 GB
Gemma 4 26B A4B ~14.4–18 GB ~28 GB ~52–58 GB
Gemma 4 31B ~17.5 GB ~34.9 GB ~69.9 GB

Add context length overhead on top of these figures. Long prompts grow KV cache significantly.

Quality differences in practice

E4B is a capable model for chat, summarization, extraction, and simple agents. It is not a weak model — it uses PLE to punch above its weight class. But 26B A4B consistently outperforms E4B on tasks that require multi-step reasoning, complex coding, and long-document understanding.

The gap shows up most clearly when:

  • A coding task requires tracking many interdependencies across a large file
  • A reasoning task requires multiple inference steps before reaching a conclusion
  • A document is long enough that earlier context meaningfully affects a later conclusion
  • Structured outputs need precise instruction-following across many constraints

For casual chat, quick summaries, and prompt exploration, the practical difference is often small enough that E4B is the better choice simply because it runs faster and with less memory pressure.

Which model for your hardware

Your machine Start here
8 GB RAM laptop E2B Q4, or E4B Q4 if it fits comfortably
16 GB Mac or PC E4B Q4 — 26B A4B is too tight at this memory level
24 GB GPU 26B A4B Q4 fits; this is the intended hardware tier
32 GB system 26B A4B Q4 comfortably; more room for context
48 GB+ 26B A4B Q8, or 31B Q4
64 GB+ workstation 31B Q8, or compare 26B A4B Q8 vs 31B Q4

Do not try to run 26B A4B on a 16 GB system at Q4 unless you understand what you are accepting: model load already uses most of your RAM before context and runtime overhead, which will push you into slow memory swapping.

Speed

Because only ~3.8B parameters are active per inference step, 26B A4B actually runs at roughly the speed of a 4B dense model despite having 26B total parameters. On the same hardware, it is typically faster than the dense 31B and significantly faster than any dense 26B would be.

E4B is faster still in wall-clock time, simply because it is a smaller model and loads faster.

Which to choose

If you are trying Gemma 4 for the first time and your machine has 8–16 GB of RAM: start with E4B Q4. It loads quickly, handles most everyday tasks well, and lets you learn whether Gemma 4 fits your workflow.

If you have a 24 GB GPU or more, and you need stronger reasoning, coding assistance, or long-context work: use 26B A4B Q4.

If quality is your top priority and memory is not a constraint: 31B is still the best model in the family.

The 26B A4B is not a compromise model. It is the recommended choice for local power users who have enough memory. E4B is the recommended choice for everyone on consumer laptops and phones.

FAQ

Does E4B have audio support?
Yes. E4B (and E2B and 12B) support audio input natively. 26B A4B and 31B do not.

Why does the E4B need more memory than its parameter count suggests?
Because of Per-Layer Embeddings. The embedding tables add to disk size and memory footprint even though they do not count in the "effective" parameter number that Google advertises.

Why does 26B A4B say 26B if only ~3.8B are active?
Because the model has 26B total parameters stored across many expert networks. You load all of them into memory, but only a subset activates during each forward pass. That is how Mixture-of-Experts models work.

Can I run 26B A4B on a 16 GB machine?
Technically possible in some configurations, but not recommended. At Q4, the model load alone approaches your memory ceiling before accounting for context, KV cache, or runtime overhead. You will likely see slow performance from memory swapping.

Related guides:

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.