Gemma 4 Guides

Gemma 4 Hardware Requirements: RAM, VRAM, and Model Size Guide

6 min read
gemma 4hardware requirementsvramram
Gemma 4 Hardware Requirements: RAM, VRAM, and Model Size Guide

Google DeepMind released Gemma 4 on April 2, 2026 — four open-weight models under the Apache 2.0 license, built from the same research behind Gemini 3. Before you download anything, the single most important question is: which model fits your hardware?

This guide answers that directly. You will find memory tables by model and quantization, VRAM scaling data by context length, real GPU benchmarks, and a simple decision tree so you know which variant to start with.


Gemma 4 Model Family Overview

Gemma 4 ships in four sizes, each available as a base and instruction-tuned variant:

Model Architecture Total Params Active Params Context Window Modalities
E2B Dense (PLE) ~5.1B ~2.3B 128K Text, Image, Audio, Video
E4B Dense (PLE) ~5.1B ~4B 128K Text, Image, Audio, Video
26B A4B MoE 26B 4B active 256K Text, Image, Video
31B Dense 31B 31B 256K Text, Image, Video

The "E" prefix stands for Effective parameters. E2B and E4B use Per-Layer Embeddings (PLE) — a technique that gives them the representational depth of a much larger model while keeping memory usage low. The 26B A4B is Gemma's first Mixture-of-Experts (MoE) model, activating only 4 billion of its 26 billion parameters during inference, which dramatically reduces VRAM pressure compared to its total size.


Gemma 4 VRAM Requirements by Model and Quantization

This is the table most people are looking for. These figures represent the minimum memory needed to load the model — your actual runtime usage will be higher depending on context length and system overhead.

Model 4-bit (Q4) 8-bit (Q8) BF16 (full precision)
E2B ~2 GB ~5 GB ~15 GB
E4B ~5 GB ~8 GB ~15 GB
26B A4B ~18 GB ~28 GB ~52 GB
31B ~20 GB ~34 GB ~62 GB

Note: The BF16 31B weights fit on a single 80 GB NVIDIA H100 GPU. For consumer local inference, quantized versions (Q4 or Q8) are the practical choice.

Quick takeaways:

  • E2B and E4B at 4-bit run on laptops with 8 GB of RAM or unified memory — including entry-level Apple Silicon Macs.
  • 26B A4B at Q4 needs approximately 18 GB but benefits from MoE efficiency — active memory pressure is much lower than a 26B dense model.
  • 31B at Q4 needs approximately 20 GB to load; a 24 GB GPU can run it at shorter context lengths.

Gemma 4 26B A4B: VRAM Requirements by Context Length

The 26B A4B is the standout model for local users. Its hybrid attention architecture means context scaling is far more efficient than previous generations — long context does not explode VRAM usage aggressively.

26B A4B @ Q4 — VRAM by context length (measured with llama.cpp on Debian 12, CUDA 12.8):

Context Length VRAM Required
4K 17.98 GB
8K 18 GB
16K 18 GB
32K 18 GB
64K 19 GB
128K 20 GB
256K 23 GB

A 24 GB GPU (RTX 3090, RTX 4090) can run the full 256K context window with room to spare. That is unusual for a model of this quality, and it is the key reason the 26B A4B is the top recommendation for most local users.


Gemma 4 31B: VRAM Requirements by Context Length

The 31B is a fully dense model — every parameter is active during inference. Memory usage scales more aggressively with context length compared to the MoE 26B.

31B @ Q4 — VRAM by context length:

Context Length VRAM Required
4K 20 GB
8K 21 GB
16K 21 GB
32K 22 GB
64K 25 GB
128K 30 GB
256K 40 GB

A 24 GB GPU can run the 31B at context lengths up to roughly 45K tokens before hitting its VRAM ceiling. For the full 256K context on the 31B, you need 40 GB or more — that means a 48 GB workstation GPU, a dual-GPU setup, or an Apple Silicon Mac with 48–64 GB of unified memory.


GPU Performance Benchmarks

Real benchmark data from llama.cpp (build 8639) on the same test system (AMD EPYC 7513, 64 GB RAM, Debian 12, CUDA 12.8). pp = prompt processing tokens/sec, tg = text generation tokens/sec.

26B A4B @ Q4

GPU Context pp (t/s) tg (t/s)
RTX 3090 4K 3,625 119
RTX 3090 128K 1,147 82
RTX 3090 256K 671 64
RTX 5090 4K 8,799 180
RTX 5090 128K 2,839 130
RTX 5090 256K 1,707 106
RTX PRO 6000 Blackwell 4K 9,437 196
RTX PRO 6000 Blackwell 256K 2,245 112

The 26B A4B delivers over 1,000 tokens/sec prompt processing at 128K context on the RTX 3090 — fast enough for practical agent workflows.

31B @ Q4

GPU Context pp (t/s) tg (t/s)
RTX 3090 4K 1,155 34
RTX 3090 32K 723 31
RTX 3090 ~45K 629 30
RTX 5090 4K 3,395 61
RTX 5090 64K 1,459 51
RTX 5090 128K 900 43
RTX PRO 6000 Blackwell 4K 3,749 61
RTX PRO 6000 Blackwell 256K 506 34

The 31B is significantly slower than the 26B — generation on an RTX 3090 sits around 30–34 tokens/sec versus 64–119 for the MoE model. If speed matters for your workflow, the 26B A4B is the better choice on consumer hardware.


Hardware Recommendations by Setup

By GPU / Memory Size

Your Hardware Recommended Model Notes
6–8 GB VRAM (GTX 1080, RTX 3070, entry laptops) E2B or E4B @ Q4 These run well on CPU+RAM too, just slower
10–16 GB VRAM (RTX 3080, M2 Pro 16 GB) E4B @ Q8 or E2B @ BF16 26B A4B still too large at Q4
20–24 GB VRAM (RTX 3090, RTX 4090) 26B A4B @ Q4 (full 256K context) Sweet spot for most local users
24 GB VRAM 31B @ Q4 (up to ~45K context) Limited context; 26B A4B is usually better here
32 GB VRAM (RTX 5090) 31B @ Q4 (up to 128K context) Comfortable 31B experience
48–96 GB VRAM (RTX PRO 6000 / multi-GPU) 31B @ Q4 or Q8 (full 256K context) Full context, maximum quality

Apple Silicon

Apple Silicon uses unified memory shared between CPU and GPU, which makes it well-suited for local LLM inference. All Gemma 4 models support MLX and llama.cpp with Metal acceleration.

Mac Configuration Recommended Model
M1 / M2 (8 GB) E2B or E4B @ Q4
M2 Pro / M3 Pro (18–36 GB) 26B A4B @ Q4
M2 Max / M3 Max (48–64 GB) 31B @ Q4 or Q8
M2 Ultra / M3 Ultra (96–192 GB) 31B @ BF16 (full precision)

Real-world note: The 26B A4B on a Mac Mini with 24 GB unified memory (Q4_K_M via Ollama, ~9.6 GB) runs well with headroom to spare. Running the 26B at full size on a 24 GB Mac can leave the system barely responsive under concurrent requests — stay at Q4 and leave memory headroom.


How to Actually Run Gemma 4 Locally

Three tools cover most local setups:

Ollama — easiest for getting started:

ollama run gemma4:e4b          # E4B (default Q4_K_M)
ollama run gemma4:26b-a4b      # 26B MoE
ollama run gemma4:31b          # 31B Dense

llama.cpp — best for CPU inference and custom quantization:

# Download and build llama.cpp, then:
llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL

Unsloth Studio — open-source web UI, works on macOS/Windows/Linux with one-line install:

# macOS / Linux
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

LM Studio also supports Gemma 4 GGUF files out of the box and is a good option if you prefer a GUI without any terminal setup.


26B A4B vs 31B: Which Should You Pick?

The choice comes down to hardware budget and what you value more.

Pick the 26B A4B if:

  • You have a 24 GB GPU and want full 256K context
  • Speed matters — it generates 2–3× more tokens per second than the 31B on the same hardware
  • You are running agent workflows, coding assistants, or anything with long context traces

Pick the 31B if:

  • You have 32 GB+ VRAM or a large unified memory Mac
  • You want a fully dense model with predictable behavior
  • You are fine-tuning and need full-parameter access
  • Raw output quality at shorter context is your top priority

For most local users on consumer hardware, the 26B A4B is the clear winner. It fits cleanly on a 24 GB GPU, scales to the full 256K context window, and delivers throughput that makes agentic workflows feel responsive.


Frequently Asked Questions

Can I run Gemma 4 without a GPU? Yes. All variants run on CPU-only via llama.cpp. Performance drops to roughly 5–10 tokens/second for text generation, which is usable for testing but slow for regular use. E2B and E4B are the most practical choices for CPU-only setups.

What is the difference between Q4 and Q8 quantization? Q4 (4-bit) reduces memory by roughly 60% compared to BF16. Q8 (8-bit) reduces by roughly 50%. Q4 loses a small amount of accuracy (approximately 2–5% on benchmarks) but makes models far more accessible. For most inference tasks, Q4_K_M is the recommended starting point. Use Q8 if you have the VRAM and want closer-to-full-precision output.

Does Gemma 4 support fine-tuning on consumer hardware? Yes, using QLoRA (Quantized LoRA). The 31B model can be fine-tuned with as little as 16 GB VRAM using QLoRA via Unsloth or TRL. Full fine-tuning requires significantly more — at least 80 GB VRAM for the 31B.

What is the difference between E2B, E4B, and the larger models? E2B and E4B are designed for on-device and mobile use. They use Per-Layer Embeddings (PLE) to punch above their parameter count and support audio input (up to 30 seconds). The 26B and 31B are designed for workstations and servers, with 256K context and stronger reasoning. All four models support images and video input.

Is Gemma 4 free for commercial use? Yes. Gemma 4 is released under the Apache 2.0 license, which allows free commercial use, fine-tuning, redistribution, and modification without MAU caps or use-case restrictions.

Do I need to add extra VRAM for the context window? Yes. The numbers in this guide are the memory required just to load the model weights. Running with a larger context window adds to that. For the 26B A4B, the addition is modest (18 GB at 4K → 23 GB at 256K). For the 31B, the increase is larger (20 GB at 4K → 40 GB at 256K). Always leave at least 2–4 GB of headroom above the model size for the runtime, KV cache, and system overhead.


Summary

If you are deciding where to start, here is the short version:

  • Lightweight machine (8 GB RAM/VRAM): Start with E2B or E4B at Q4.
  • Mid-range machine (16–20 GB): E4B at Q8 or try 26B A4B at aggressive quantization.
  • 24 GB GPU (RTX 3090 / 4090): 26B A4B at Q4 — run the full 256K context comfortably. This is the sweet spot.
  • 32 GB GPU (RTX 5090) or 48 GB+ Mac: 31B at Q4 for full context and maximum quality.

The Gemma 4 family is one of the most hardware-efficient open model releases to date. The 26B MoE in particular makes full 256K context inference accessible on hardware that previously could not get close to those numbers.

Recommended next reads

Related guides

Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Still deciding what to read next?

Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.