Gemma 4 Guides
How to Run Gemma 4 in Ollama: Tags, Hardware, and First Run

Quick answer
Yes, Ollama supports Gemma 4. Support landed with Ollama v0.20.0 on April 3, 2026 — the same day Google released the model. Two commands get you running:
ollama pull gemma4
ollama run gemma4
The default tag is gemma4:e4b — a 9.6 GB model that fits comfortably on most developer machines. If you want a different size, see the tag table below before pulling anything.
All Gemma 4 Ollama tags
This is the most common question in search data, so it goes first.
| Tag | Size on disk | Context window | Architecture | Audio input | Best for |
|---|---|---|---|---|---|
gemma4:e2b |
7.2 GB | 128K | Dense (2.3B effective) | Yes | Laptops, edge, lowest hardware bar |
gemma4:e4b (default) |
9.6 GB | 128K | Dense (4.5B effective) | Yes | Most developers, best starting point |
gemma4:26b |
18 GB | 256K | MoE (3.8B active) | No | Best quality-per-GB, fast inference |
gemma4:31b |
20 GB | 256K | Dense (30.7B) | No | Maximum quality, coding, reasoning |
A few things worth noting:
- The "E" in E2B and E4B stands for "effective" parameters — these are the edge-first models designed for laptops and mobile devices.
gemma4:26bis a Mixture-of-Experts model. Only 3.8 billion parameters activate during inference, so it runs faster than its total size suggests — often comparable in speed to a 4B dense model while delivering quality closer to a 13B model.gemma4:latestresolves togemma4:e4b. When you runollama run gemma4without a tag, that is what you get.
Prerequisite: Ollama version check
Gemma 4 requires Ollama v0.20.0 or later. Earlier builds will fail to pull the model. Check your version first:
ollama --version
If you are on an older version, update before trying to pull:
# macOS (Homebrew)
brew upgrade ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
On Windows, download the latest installer from ollama.com.
Hardware requirements
Check these numbers before downloading. A model that barely fits is usually worse than a smaller model that runs smoothly.
| Model | Minimum RAM / VRAM | Comfortable setup | Notes |
|---|---|---|---|
gemma4:e2b |
8 GB | 16 GB | Best for CPU-only machines |
gemma4:e4b |
10 GB VRAM or 16 GB unified memory | 16–24 GB | Default model, fits most consumer GPUs |
gemma4:26b |
20 GB RAM or unified memory | 24–32 GB | MoE — active inference is lighter than size suggests |
gemma4:31b |
24 GB VRAM or 32 GB unified memory | 32 GB+ | Quality-first, not a casual first download |
On Apple Silicon (M1/M2/M3/M4), unified memory works well for all sizes. A Mac with 16 GB handles e4b comfortably. The 26b model fits on 24 GB but leaves little headroom — treat it as the ceiling, not the target.
On NVIDIA GPUs, the VRAM numbers above are the hard limits. The model needs to fit entirely in VRAM for GPU-accelerated inference. If it does not fit, Ollama falls back to CPU, which is significantly slower.
CPU-only machines can run Gemma 4, but expect roughly 1–3 tokens per second on e4b. Use e2b for better CPU performance.
Which model should you pick?
Start with the smallest model that fits your hardware comfortably, not the largest one that technically loads.
- Under 16 GB RAM / VRAM → start with
gemma4:e2b - 16 GB RAM or 10+ GB VRAM →
gemma4:e4bis the right default - 24+ GB unified memory or VRAM →
gemma4:26bgives significantly better quality with MoE efficiency - 32 GB+, quality matters most →
gemma4:31bfor coding, reasoning, and document-scale tasks
For most developers doing local experimentation, e4b is the right answer. Only move up after confirming the first run feels stable and responsive.
Pull and run commands
Pull without running (recommended for large models):
ollama pull gemma4 # pulls e4b (default, 9.6 GB)
ollama pull gemma4:e2b # 7.2 GB
ollama pull gemma4:26b # 18 GB
ollama pull gemma4:31b # 20 GB
Run interactively:
ollama run gemma4 # starts e4b
ollama run gemma4:e2b
ollama run gemma4:26b
ollama run gemma4:31b
Check what you have installed:
ollama list
Check which models are currently loaded in memory:
ollama ps
Using the local API
Ollama exposes a local REST API at http://localhost:11434 once the model is running. You can call it from any HTTP client — no cloud dependency, no API key.
curl (generate)
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"prompt": "Explain the difference between MoE and dense transformer architectures.",
"stream": false
}'
curl (chat, OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "Write a Python function to parse JSON safely."}
]
}'
Python (ollama library)
from ollama import chat
response = chat(
model='gemma4',
messages=[{'role': 'user', 'content': 'What is mixture of experts?'}],
)
print(response.message.content)
Python (OpenAI SDK, drop-in compatible)
Because Ollama's API is OpenAI-compatible, you can point the official OpenAI SDK at your local instance:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK but unused by Ollama
)
response = client.chat.completions.create(
model="gemma4",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a function to flatten a nested list in Python."}
]
)
print(response.choices[0].message.content)
JavaScript
import ollama from 'ollama'
const response = await ollama.chat({
model: 'gemma4',
messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)
What Gemma 4 can do that Gemma 3 could not
These are not incremental improvements — the benchmark gaps are substantial:
| Benchmark | Gemma 4 31B | Gemma 4 E4B | Gemma 3 27B |
|---|---|---|---|
| AIME 2026 (math reasoning) | 89.2% | 42.5% | 20.8% |
| LiveCodeBench v6 (coding) | 80.0% | 52.0% | 29.1% |
| Codeforces ELO | 2150 | 940 | 110 |
| MMLU Pro (knowledge) | 85.2% | 69.4% | 67.6% |
| GPQA Diamond (science) | 84.3% | 58.6% | 42.4% |
Beyond benchmarks, Gemma 4 adds capabilities that were absent from Gemma 3:
- Native function calling — all four variants support structured tool use out of the box, returning valid JSON matching your schema
- Thinking modes — you can enable or disable chain-of-thought reasoning per request using the
<|think|>token in the system prompt - 256K context on the 26B and 31B models (up from 128K in Gemma 3 27B)
- Audio input on E2B and E4B — speech recognition and understanding alongside text and image
- 140+ languages natively supported
Thinking mode
Gemma 4 supports configurable chain-of-thought reasoning. To enable it, include the <|think|> token at the start of your system prompt:
from ollama import chat
response = chat(
model='gemma4:31b',
messages=[
{
'role': 'system',
'content': '<|think|> Think step by step before answering.'
},
{
'role': 'user',
'content': 'What is the integral of x^2 from 0 to 3?'
}
],
)
print(response.message.content)
To disable thinking, remove the <|think|> token from the system prompt. For E2B and E4B, thinking is fully off when the token is absent. For 26B and 31B, the model still generates the thought tags but with an empty thought block.
For simple lookups or casual chat, skip thinking. For math, complex coding, or document analysis, enable it — the quality difference is significant on the larger models.
Common errors and fixes
Error: gemma4:e4b requires a newer version of Ollama
Your Ollama build predates v0.20.0. Run the update command for your OS (see Prerequisite section above), then try again.
Out of memory / model fails to load
Check available VRAM or unified memory with ollama ps. If the model is too large, switch to a smaller tag. gemma4:e2b (7.2 GB) is the lightest official option.
Slow responses (1–5 tokens/second)
If Ollama is not using your GPU, the model is running on CPU. Check that your GPU drivers are current and that Ollama can see your GPU. On Apple Silicon, make sure you are on a recent Ollama build — MLX acceleration support was added in v0.20.0.
Port 11434 already in use
Another Ollama instance is running, or something else has taken the port. You can set a custom port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Then update your API calls to use port 11435.
Responses cut off before completing
The context window may be too small for your prompt. Increase it per request:
curl http://localhost:11434/api/generate \
-d '{
"model": "gemma4",
"prompt": "...",
"options": {"num_ctx": 32768}
}'
gemma4:26b barely fits but feels slow
The 26B model on a 24 GB machine leaves very little memory headroom. Other processes competing for memory will degrade performance significantly. Close other GPU-heavy applications, or drop to e4b if you need consistent responsiveness.
What to check before blaming the model
If the output quality feels worse than expected, run through this list before switching to a larger model:
- Confirm you are on the model size you intended —
ollama listshows what is installed - Check that GPU inference is active —
ollama psshows which processor is being used - Try enabling thinking mode if the task involves reasoning or math
- Check that your context window is large enough for the full prompt
- Use the recommended sampling settings:
temperature=1.0,top_p=0.95,top_k=64
In most cases, e4b with thinking mode enabled handles tasks that initially seemed to require 31b.
Next steps
If Ollama is not the right fit for your setup, two common alternatives:
- LM Studio — a GUI-first local runtime, good if you prefer not working in the terminal
- llama.cpp — more configuration control, better for CPU-heavy or constrained environments
If you want to try Gemma 4 without any local setup, Google AI Studio offers hosted access to the 31B and 26B models.
Related guides
Related guides
Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Does LM Studio Support Gemma 4? Compatibility, Model List, and Requirements
A clear answer to whether LM Studio supports Gemma 4, with the supported model list, minimum memory, and practical setup expectations.

Gemma 4 26B A4B VRAM Requirements: Q4, Q8, F16, and 24 GB GPU Fit
A focused Gemma 4 26B A4B VRAM requirements guide with exact GGUF sizes, planning ranges, and why the 26B is the local sweet spot.

Gemma 4 31B VRAM Requirements: Q4, Q8, F16, and Practical Hardware
A focused Gemma 4 31B VRAM requirements guide with exact GGUF sizes, planning ranges, and honest advice on what hardware makes sense.
Still deciding what to read next?
Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.
