Gemma 4 Guides

Gemma 4 Model Comparison: 31B vs 26B A4B vs E4B vs E2B

8 min read
gemma 4model comparison31b26be4be2ba4b
Gemma 4 Model Comparison: 31B vs 26B A4B vs E4B vs E2B

Google released Gemma 4 on April 3, 2026 — but it is not one model. It is four separate models sharing a family name, each making different tradeoffs around memory, speed, modality support, and reasoning quality. Picking the wrong one means downloading gigabytes you cannot run, or running something underpowered when your hardware could handle more.

This guide decodes the naming system, lays out the real differences, and gives you a clear decision path before you pull a single weight file.


What the Names Actually Mean

The Gemma 4 naming convention confuses almost everyone the first time. Here is what each prefix and suffix actually encodes.

E2B and E4B — "Effective" parameters, built for edge

The "E" stands for effective parameters. E2B has 2.3 billion effective parameters during inference, but its total parameter count is 5.1 billion. E4B works the same way. The gap exists because Google uses a technique called Per-Layer Embeddings (PLE): each decoder layer carries its own small embedding table that feeds a residual signal into that layer's computation. Those tables are large on disk but cheap to compute, which is why the model behaves like a 2B at runtime while technically weighing more. The result is a model sized for phones and laptops that carries more representational depth than the parameter count suggests.

26B A4B — "Active" parameters, MoE architecture

The "A" stands for active parameters. The 26B A4B is a Mixture-of-Experts (MoE) model with 25.2 billion total parameters but only 3.8 billion active during any single inference step. Google built this model with 128 small experts, activating 8 plus one shared always-on expert per token. The practical result: it runs almost as fast as a 4B dense model, but produces quality much closer to the 31B. The "26B" tells you storage requirements; the "A4B" tells you compute cost.

31B — Dense, no tricks

Every parameter fires on every forward pass. You pay the full compute bill, but you get the simplest behavior, the highest quality ceiling in the family, and the cleanest base for fine-tuning.


The Four Models at a Glance

E2B E4B 26B A4B 31B
Architecture Dense (Edge) Dense (Edge) Mixture-of-Experts Dense
Effective / Active params ~2.3B ~4B ~3.8B active 30.7B
Total params 5.1B ~9B 25.2B 30.7B
Context window 128K 128K 256K 256K
Audio input
Image / Video input
Target hardware Phone / IoT Laptop Consumer GPU H100 / High-end GPU
Memory (4-bit quant) ~5 GB ~8 GB ~18 GB ~20 GB
Memory (8-bit / 16-bit) ~15 GB ~28 GB ~34 GB
LMArena Elo (text) 1441 1452
Open model rank #6 #3

Memory figures are approximate planning values from Unsloth's deployment guide. Real usage varies with context length, quantization method, and system overhead.


Benchmark Numbers

All scores below are from Google's official Gemma 4 model card and instruction-tuned variants unless noted. Benchmarks use AIME 2026, LiveCodeBench v6, and MMLU Pro — newer versions than tests used for Gemma 3, so direct generational comparisons should be read as directional.

31B Dense

Benchmark Score
AIME 2026 (math) 89.2%
LiveCodeBench v6 (coding) 80.0%
GPQA Diamond (science reasoning) 84.3%
MMLU Pro (knowledge) 85.2%
MMMU Pro (vision) 76.9%
MATH-Vision 85.6%
Codeforces ELO 2,150
Multi-needle retrieval (long context) 66.4%

For context: Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench. The improvement is generational, not incremental.

26B A4B (MoE)

Benchmark Score
AIME 2026 88.3%
LiveCodeBench v6 77.1%
GPQA Diamond 82.3%
MMLU Pro 82.6%

The 26B A4B achieves roughly 97% of the dense 31B model's quality while activating only 3.8B parameters per token — about 8× less compute per inference step. On the LMArena leaderboard it scores 1441 Elo versus 1452 for the 31B, a gap that will be invisible in most real-world tasks.

E4B

Benchmark Score
AIME 2026 42.5%
LiveCodeBench v6 52.0%
MMLU Pro 69.4%
MMMU Pro (vision) 52.6%

Strong for a model that runs on a T4 GPU or MacBook Air. The reasoning gap versus the workstation models is real, but E4B handles OCR, image grounding, and coding assistance at a level that justifies its position in an edge deployment.

E2B

Benchmark Score
AIME 2026 37.5%
LiveCodeBench v6 44.0%
MMLU Pro 60.0%
MMMU Pro (vision) 44.2%

E2B is the floor of the family. It works on phones and Raspberry Pi-class hardware. Google's own tests show Gemma 4 E2B running on a Raspberry Pi 5 via LiteRT-LM at around 7.6 tokens per second decode speed — slow but functional for edge agent workflows.


The Key Differences That Actually Matter

Audio is not a family-wide feature

Only E2B and E4B support audio input — speech recognition and audio-to-text translation. Audio is capped at 30 seconds per clip. The 26B A4B and 31B do not support audio at all. If your use case requires speech input, the choice is made for you before you look at anything else.

Context length splits the family in two

E2B and E4B top out at 128K tokens. The 26B A4B and 31B reach 256K. This matters more than the raw number suggests. Gemma 3's 128K context was mostly theoretical — retrieval reliability broke down at long ranges. Gemma 4's 256K context is functional: the 31B went from 13.5% to 66.4% on multi-needle retrieval tests, meaning the model can actually find and reason over information buried deep in a long document, not just accept it.

MoE vs Dense is a speed and fine-tuning tradeoff

The 26B A4B runs at roughly the speed of a 4B dense model during inference because only 3.8B parameters activate per token. For agentic workflows where you are generating hundreds of tokens across many tool calls, that speed advantage compounds significantly. The 31B Dense is slower but offers more predictable behavior and is the stronger candidate for fine-tuning — every layer fires every time, which simplifies gradient flow during training.

Video support has hard limits

All four models can process video, but video is handled as a sequence of frames at one frame per second, capped at 60 seconds. This is useful for short clips, UI recordings, or summarizing a short demo — not for real-time video analysis or long-form content.

Knowledge cutoff is January 2025

Gemma 4's pretraining data cuts off in January 2025. A 256K context window does not change this. For domains that have changed since then, you need retrieval augmentation or tool access rather than relying on the model's internal knowledge.


Hardware Requirements

These are approximate values for quantized inference. "Total memory" means RAM + VRAM combined for unified-memory systems (Apple Silicon, integrated setups), or available VRAM for discrete GPU setups.

Model 4-bit quantized 8-bit quantized Unquantized (BF16)
E2B ~5 GB ~15 GB
E4B ~8 GB
26B A4B ~18 GB ~28 GB
31B ~20 GB ~34 GB ~80 GB (single H100)

Practical translation:

  • Phone or Raspberry Pi — E2B via LiteRT-LM or AI Edge Gallery
  • MacBook Air (8 GB unified memory) — E4B at 4-bit runs comfortably
  • Laptop or desktop with 16 GB RAM — 26B A4B at 4-bit is the right target
  • RTX 3090 / RTX 4090 (24 GB VRAM) — 26B A4B runs fully with 256K context; 31B at 4-bit is feasible
  • NVIDIA H100 (80 GB) — 31B at full BF16 precision, no quantization needed
  • NVIDIA DGX Spark (128 GB unified) — 31B at BF16 with headroom

One thing worth flagging: the 26B A4B's 25.2B total parameters still need to live in memory even though only 3.8B activate per step. You pay for storage once when loading; you pay for 3.8B worth of compute at each token. Budget for the former when sizing hardware.


Which Model Should You Run?

Choose E2B if: you are building on-device mobile applications, IoT agents, or anything that needs to run on a phone without a network connection. Also the right call if you need audio input and have severely limited memory.

Choose E4B if: you want audio input support with noticeably better reasoning than E2B, and you have an 8–16 GB laptop or mid-range GPU. This is the default edge choice for most developers who are not RAM-constrained to the absolute floor.

Choose 26B A4B if: you have a consumer GPU with 16–24 GB of memory and want near-31B quality with faster inference. This is the sweet spot for local agentic workflows, coding assistants, and document processing where speed matters. It is also the right pick for any deployment where latency affects user experience.

Choose 31B if: you want the highest quality output in the family, you are planning to fine-tune, or you are running on hardware that can handle it comfortably. Do not default to 31B just because more parameters sounds better — the 26B A4B is close enough in quality that many users will not notice the difference in practice.


Where to Access Gemma 4

  • Google AI Studio — hosted 31B and 26B A4B, no local setup required
  • Google AI Edge Gallery — hosted E4B and E2B, optimized for mobile testing
  • Hugging Face — all four models as google/gemma-4-31B-it, google/gemma-4-26B-A4B-it, google/gemma-4-E4B-it, google/gemma-4-E2B-it
  • Ollamaollama run gemma4:e4b, gemma4:26b, etc.
  • LM Studio — GUI-based local setup for 26B and 31B
  • llama.cpp — cross-platform CPU/GPU inference
  • MLX — Apple Silicon optimized inference

All weights are Apache 2.0 licensed — no MAU limits, no usage restrictions, commercial use permitted without additional terms.


FAQ

What does "A4B" mean in Gemma 4 26B A4B? The "A" stands for active parameters. The 26B A4B is a Mixture-of-Experts model with 25.2 billion total parameters, but only 3.8 billion activate per inference step. It runs like a 4B model in terms of compute while delivering quality close to the full 26B.

What does "E2B" and "E4B" mean? The "E" stands for effective parameters. These models use Per-Layer Embeddings (PLE) — a technique where each decoder layer has its own small embedding table. The models have more total parameters than their "E" number suggests, but their runtime compute footprint matches the effective parameter count. E2B behaves like a 2B model at runtime.

Does Gemma 4 support audio? Only on E2B and E4B. Both support audio input for speech recognition and audio-to-text translation, up to 30 seconds per clip. The 26B A4B and 31B do not support audio input.

How much VRAM does Gemma 4 31B need? Around 20 GB for 4-bit quantized inference, 34 GB for 8-bit, and a single 80 GB H100 for unquantized BF16. For most local users, 4-bit on an RTX 3090 or RTX 4090 (24 GB VRAM) is the practical path.

What is the difference between E2B and E4B? Both are edge models with audio support and a 128K context window. E4B has more capacity: it scores 69.4% vs 60.0% on MMLU Pro, 52.0% vs 44.0% on LiveCodeBench, and 52.6% vs 44.2% on MMMU Pro (vision). E4B requires roughly 8 GB at 4-bit versus 5 GB for E2B. If your hardware can handle E4B, it is the better default.

Can I run Gemma 4 26B A4B on a laptop with 16 GB RAM? Yes, at 4-bit quantization with approximately 18 GB of total memory required. On a system with 16 GB RAM and a discrete GPU sharing memory, you may be at the margin. On Apple Silicon with 24 GB unified memory, it runs comfortably.

Is Gemma 4 better than Gemma 3? Significantly. AIME 2026 math scores went from 20.8% (Gemma 3 27B) to 89.2% (Gemma 4 31B). LiveCodeBench went from 29.1% to 80.0%. Long-context multi-needle retrieval went from 13.5% to 66.4%. These are generational improvements, not incremental ones.

Related guides

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.