Gemma 4 Guides
Gemma 4 Model Comparison: E2B vs E4B vs 12B vs 26B A4B vs 31B

Google released the first Gemma 4 models in early April 2026, then expanded the family with Gemma 4 12B in June 2026. The current family has five main sizes sharing one name, each making different tradeoffs around memory, speed, modality support, and reasoning quality. Picking the wrong one means downloading gigabytes you cannot run, or running something underpowered when your hardware could handle more.
This guide decodes the naming system, lays out the real differences, and gives you a clear decision path before you pull a single weight file.
What the Names Actually Mean
The Gemma 4 naming convention confuses almost everyone the first time. Here is what each prefix and suffix actually encodes.
E2B and E4B — "Effective" parameters, built for edge
The "E" stands for effective parameters. E2B has 2.3 billion effective parameters during inference, but its total parameter count is 5.1 billion. E4B works the same way. The gap exists because Google uses a technique called Per-Layer Embeddings (PLE): each decoder layer carries its own small embedding table that feeds a residual signal into that layer's computation. Those tables are large on disk but cheap to compute, which is why the model behaves like a 2B at runtime while technically weighing more. The result is a model sized for phones and laptops that carries more representational depth than the parameter count suggests.
12B — Unified multimodal middle ground
The June 2026 addition is Gemma 4 12B, a unified multimodal model that replaces separate vision and audio encoders with direct input projections. In practical terms, 12B fills the gap between the edge-focused E models and the workstation-focused 26B/31B models: it is much easier to host than the largest models, but it gives teams a stronger audio/video-capable option than E4B.
26B A4B — "Active" parameters, MoE architecture
The "A" stands for active parameters. The 26B A4B is a Mixture-of-Experts (MoE) model with 25.2 billion total parameters but only 3.8 billion active during any single inference step. Google built this model with 128 small experts, activating 8 plus one shared always-on expert per token. The practical result: it runs almost as fast as a 4B dense model, but produces quality much closer to the 31B. The "26B" tells you storage requirements; the "A4B" tells you compute cost.
31B — Dense, no tricks
Every parameter fires on every forward pass. You pay the full compute bill, but you get the simplest behavior, the highest quality ceiling in the family, and the cleanest base for fine-tuning.
The Five Models at a Glance
| E2B | E4B | 12B | 26B A4B | 31B | |
|---|---|---|---|---|---|
| Architecture | Dense / PLE edge | Dense / PLE edge | Unified multimodal | Mixture-of-Experts | Dense |
| Effective / Active params | ~2.3B | ~4B | 12B | ~4B active | 30.7B |
| Total params | 5.1B | ~9B | 12B | 26B | 31B |
| Context window | 128K | 128K | 256K | 256K | 256K |
| Audio input | Yes | Yes | Yes | No | No |
| Image / Video input | Yes | Yes | Yes | Yes | Yes |
| Target hardware | Phone / IoT | Laptop | Mid-range GPU / server | Consumer GPU | H100 / High-end GPU |
| Memory (Q4) | ~2.9 GB | ~4.5 GB | ~6.7 GB | ~14.4 GB | ~17.5 GB |
| Memory (8-bit / BF16) | ~5.7 / 11.4 GB | ~8.9 / 17.9 GB | ~13.4 / 26.7 GB | ~28.8 / 57.7 GB | ~34.9 / 69.9 GB |
| Primary role | Smallest edge option | Stronger edge option | Balanced multimodal middle | Efficient high-end option | Highest dense quality |
Memory figures are approximate official planning values for loading model weights. Real usage varies with context length, quantization method, runtime, and system overhead.
Benchmark Numbers
All scores below are from Google's official Gemma 4 model card and instruction-tuned variants unless noted. Benchmarks use AIME 2026, LiveCodeBench v6, and MMLU Pro — newer versions than tests used for Gemma 3, so direct generational comparisons should be read as directional.
31B Dense
| Benchmark | Score |
|---|---|
| AIME 2026 (math) | 89.2% |
| LiveCodeBench v6 (coding) | 80.0% |
| GPQA Diamond (science reasoning) | 84.3% |
| MMLU Pro (knowledge) | 85.2% |
| MMMU Pro (vision) | 76.9% |
| MATH-Vision | 85.6% |
| Codeforces ELO | 2,150 |
| Multi-needle retrieval (long context) | 66.4% |
For context: Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench. The improvement is generational, not incremental.
26B A4B (MoE)
| Benchmark | Score |
|---|---|
| AIME 2026 | 88.3% |
| LiveCodeBench v6 | 77.1% |
| GPQA Diamond | 82.3% |
| MMLU Pro | 82.6% |
The 26B A4B achieves roughly 97% of the dense 31B model's quality while activating only 3.8B parameters per token — about 8× less compute per inference step. On the LMArena leaderboard it scores 1441 Elo versus 1452 for the 31B, a gap that will be invisible in most real-world tasks.
E4B
| Benchmark | Score |
|---|---|
| AIME 2026 | 42.5% |
| LiveCodeBench v6 | 52.0% |
| MMLU Pro | 69.4% |
| MMMU Pro (vision) | 52.6% |
Strong for a model that runs on a T4 GPU or MacBook Air. The reasoning gap versus the workstation models is real, but E4B handles OCR, image grounding, and coding assistance at a level that justifies its position in an edge deployment.
E2B
| Benchmark | Score |
|---|---|
| AIME 2026 | 37.5% |
| LiveCodeBench v6 | 44.0% |
| MMLU Pro | 60.0% |
| MMMU Pro (vision) | 44.2% |
E2B is the floor of the family. It works on phones and Raspberry Pi-class hardware. Google's own tests show Gemma 4 E2B running on a Raspberry Pi 5 via LiteRT-LM at around 7.6 tokens per second decode speed — slow but functional for edge agent workflows.
The Key Differences That Actually Matter
Audio is not a family-wide feature
E2B, E4B, and 12B support native audio input. The 26B A4B and 31B remain text-plus-visual models without native audio input. If your use case requires speech or audio understanding, the choice now starts with E2B, E4B, or 12B rather than only the two smallest models.
Context length splits the family in two
E2B and E4B top out at 128K tokens. The 12B, 26B A4B, and 31B reach 256K. This matters more than the raw number suggests. Gemma 3's 128K context was mostly theoretical — retrieval reliability broke down at long ranges. Gemma 4's 256K context is functional: the 31B went from 13.5% to 66.4% on multi-needle retrieval tests, meaning the model can actually find and reason over information buried deep in a long document, not just accept it.
MoE vs Dense is a speed and fine-tuning tradeoff
The 26B A4B runs at roughly the speed of a 4B dense model during inference because only 3.8B parameters activate per token. For agentic workflows where you are generating hundreds of tokens across many tool calls, that speed advantage compounds significantly. The 31B Dense is slower but offers more predictable behavior and is the stronger candidate for fine-tuning — every layer fires every time, which simplifies gradient flow during training.
Video support has hard limits
All five current models can process video, but video is handled as a sequence of frames at one frame per second, capped at 60 seconds. This is useful for short clips, UI recordings, or summarizing a short demo — not for real-time video analysis or long-form content.
Knowledge cutoff is January 2025
Gemma 4's pretraining data cuts off in January 2025. A 256K context window does not change this. For domains that have changed since then, you need retrieval augmentation or tool access rather than relying on the model's internal knowledge.
Hardware Requirements
These are approximate values for quantized inference. "Total memory" means RAM + VRAM combined for unified-memory systems (Apple Silicon, integrated setups), or available VRAM for discrete GPU setups.
| Model | Q4 | 8-bit | BF16 |
|---|---|---|---|
| E2B | ~2.9 GB | ~5.7 GB | ~11.4 GB |
| E4B | ~4.5 GB | ~8.9 GB | ~17.9 GB |
| 12B | ~6.7 GB | ~13.4 GB | ~26.7 GB |
| 26B A4B | ~14.4 GB | ~28.8 GB | ~57.7 GB |
| 31B | ~17.5 GB | ~34.9 GB | ~69.9 GB |
Practical translation:
- Phone or Raspberry Pi — E2B via LiteRT-LM or AI Edge Gallery
- MacBook Air (8 GB unified memory) — E4B at 4-bit runs comfortably
- Mid-range GPU / 16 GB unified memory — 12B at Q4 becomes the balanced audio/video option
- Laptop or desktop with 16 GB RAM — 26B A4B at 4-bit is the right target
- RTX 3090 / RTX 4090 (24 GB VRAM) — 26B A4B runs fully with 256K context; 31B at 4-bit is feasible
- NVIDIA H100 (80 GB) — 31B at full BF16 precision, no quantization needed
- NVIDIA DGX Spark (128 GB unified) — 31B at BF16 with headroom
One thing worth flagging: the 26B A4B's 25.2B total parameters still need to live in memory even though only 3.8B activate per step. You pay for storage once when loading; you pay for 3.8B worth of compute at each token. Budget for the former when sizing hardware.
Which Model Should You Run?
Choose E2B if: you are building on-device mobile applications, IoT agents, or anything that needs to run on a phone without a network connection. Also the right call if you need audio input and have severely limited memory.
Choose E4B if: you want audio input support with noticeably better reasoning than E2B, and you have an 8–16 GB laptop or mid-range GPU. This is the default edge choice for most developers who are not RAM-constrained to the absolute floor.
Choose 12B if: you want a balanced multimodal model with audio and video support, 256K context, and a much lower memory footprint than the 26B A4B or 31B. It is the new middle path for teams that need more capability than E4B without jumping straight to workstation-class models.
Choose 26B A4B if: you have a consumer GPU with 16–24 GB of memory and want near-31B quality with faster inference. This is the sweet spot for local agentic workflows, coding assistants, and document processing where speed matters. It is also the right pick for any deployment where latency affects user experience.
Choose 31B if: you want the highest quality output in the family, you are planning to fine-tune, or you are running on hardware that can handle it comfortably. Do not default to 31B just because more parameters sounds better — the 26B A4B is close enough in quality that many users will not notice the difference in practice.
Where to Access Gemma 4
- Google AI Studio — hosted access for selected Gemma 4 variants, no local setup required
- Google AI Edge Gallery — mobile-oriented testing for smaller models
- Hugging Face — official Google repositories for E2B, E4B, 12B, 26B A4B, and 31B
- Ollama —
ollama run gemma4:e4b,gemma4:26b, etc. - LM Studio — GUI-based local setup for 26B and 31B
- llama.cpp — cross-platform CPU/GPU inference
- MLX — Apple Silicon optimized inference
All weights are Apache 2.0 licensed — no MAU limits, no usage restrictions, commercial use permitted without additional terms.
FAQ
What does "A4B" mean in Gemma 4 26B A4B? The "A" stands for active parameters. The 26B A4B is a Mixture-of-Experts model with 25.2 billion total parameters, but only 3.8 billion activate per inference step. It runs like a 4B model in terms of compute while delivering quality close to the full 26B.
What does "E2B" and "E4B" mean? The "E" stands for effective parameters. These models use Per-Layer Embeddings (PLE) — a technique where each decoder layer has its own small embedding table. The models have more total parameters than their "E" number suggests, but their runtime compute footprint matches the effective parameter count. E2B behaves like a 2B model at runtime.
Does Gemma 4 support audio? Yes, but not on every size. E2B, E4B, and 12B support native audio input. The 26B A4B and 31B do not support native audio input.
How much VRAM does Gemma 4 31B need? Around 20 GB for 4-bit quantized inference, 34 GB for 8-bit, and a single 80 GB H100 for unquantized BF16. For most local users, 4-bit on an RTX 3090 or RTX 4090 (24 GB VRAM) is the practical path.
What does the new 12B model change? 12B adds a middle tier: native audio, image, and video input plus 256K context without the memory footprint of the 26B A4B or 31B. If E4B feels too small but 26B/31B feel too heavy, 12B is now the first model to evaluate.
Can I run Gemma 4 26B A4B on a laptop with 16 GB RAM? Yes, at 4-bit quantization with approximately 18 GB of total memory required. On a system with 16 GB RAM and a discrete GPU sharing memory, you may be at the margin. On Apple Silicon with 24 GB unified memory, it runs comfortably.
Is Gemma 4 better than Gemma 3? Significantly. AIME 2026 math scores went from 20.8% (Gemma 3 27B) to 89.2% (Gemma 4 31B). LiveCodeBench went from 29.1% to 80.0%. Long-context multi-needle retrieval went from 13.5% to 66.4%. These are generational improvements, not incremental ones.
Related guides
Related guides
Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Gemma 4 A4B vs E4B: What the Names Actually Mean and Which to Run
E stands for effective parameters, A stands for active parameters. They describe completely different architectures. Here is how to pick the right one for your machine.

Gemma 4 26B vs 31B: Which Model Should You Run?
A practical Gemma 4 26B vs 31B comparison for people deciding between the MoE sweet spot and the strongest dense model in the family.

Gemma 4 E2B vs E4B: Which Small Model Should You Choose?
A practical Gemma 4 E2B vs E4B guide for people choosing between the two small models, with real benchmark gaps and memory guidance.
Still deciding what to read next?
Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.
