How to Run Gemma 4 in Ollama: Tags, Hardware, and First Run

Quick answer

Yes, Ollama supports Gemma 4. Support landed with Ollama v0.20.0 on April 3, 2026 — the same day Google released the model. Two commands get you running:

ollama pull gemma4
ollama run gemma4

The default tag is gemma4:e4b — a 9.6 GB model that fits comfortably on most developer machines. If you want a different size, see the tag table below before pulling anything.

All Gemma 4 Ollama tags

This is the most common question in search data, so it goes first.

Tag	Size on disk	Context window	Architecture	Audio input	Best for
`gemma4:e2b`	7.2 GB	128K	Dense (2.3B effective)	Yes	Laptops, edge, lowest hardware bar
`gemma4:e4b` (default)	9.6 GB	128K	Dense (4.5B effective)	Yes	Most developers, best starting point
`gemma4:26b`	18 GB	256K	MoE (3.8B active)	No	Best quality-per-GB, fast inference
`gemma4:31b`	20 GB	256K	Dense (30.7B)	No	Maximum quality, coding, reasoning

A few things worth noting:

The "E" in E2B and E4B stands for "effective" parameters — these are the edge-first models designed for laptops and mobile devices.
gemma4:26b is a Mixture-of-Experts model. Only 3.8 billion parameters activate during inference, so it runs faster than its total size suggests — often comparable in speed to a 4B dense model while delivering quality closer to a 13B model.
gemma4:latest resolves to gemma4:e4b. When you run ollama run gemma4 without a tag, that is what you get.

Prerequisite: Ollama version check

Gemma 4 requires Ollama v0.20.0 or later. Earlier builds will fail to pull the model. Check your version first:

ollama --version

If you are on an older version, update before trying to pull:

# macOS (Homebrew)
brew upgrade ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the latest installer from ollama.com.

Hardware requirements

Check these numbers before downloading. A model that barely fits is usually worse than a smaller model that runs smoothly.

Model	Minimum RAM / VRAM	Comfortable setup	Notes
`gemma4:e2b`	8 GB	16 GB	Best for CPU-only machines
`gemma4:e4b`	10 GB VRAM or 16 GB unified memory	16–24 GB	Default model, fits most consumer GPUs
`gemma4:26b`	20 GB RAM or unified memory	24–32 GB	MoE — active inference is lighter than size suggests
`gemma4:31b`	24 GB VRAM or 32 GB unified memory	32 GB+	Quality-first, not a casual first download

On Apple Silicon (M1/M2/M3/M4), unified memory works well for all sizes. A Mac with 16 GB handles e4b comfortably. The 26b model fits on 24 GB but leaves little headroom — treat it as the ceiling, not the target.

On NVIDIA GPUs, the VRAM numbers above are the hard limits. The model needs to fit entirely in VRAM for GPU-accelerated inference. If it does not fit, Ollama falls back to CPU, which is significantly slower.

CPU-only machines can run Gemma 4, but expect roughly 1–3 tokens per second on e4b. Use e2b for better CPU performance.

Which model should you pick?

Start with the smallest model that fits your hardware comfortably, not the largest one that technically loads.

Under 16 GB RAM / VRAM → start with gemma4:e2b
16 GB RAM or 10+ GB VRAM → gemma4:e4b is the right default
24+ GB unified memory or VRAM → gemma4:26b gives significantly better quality with MoE efficiency
32 GB+, quality matters most → gemma4:31b for coding, reasoning, and document-scale tasks

For most developers doing local experimentation, e4b is the right answer. Only move up after confirming the first run feels stable and responsive.

Pull and run commands

Pull without running (recommended for large models):

ollama pull gemma4          # pulls e4b (default, 9.6 GB)
ollama pull gemma4:e2b      # 7.2 GB
ollama pull gemma4:26b      # 18 GB
ollama pull gemma4:31b      # 20 GB

Run interactively:

ollama run gemma4           # starts e4b
ollama run gemma4:e2b
ollama run gemma4:26b
ollama run gemma4:31b

Check what you have installed:

ollama list

Check which models are currently loaded in memory:

ollama ps

Using the local API

Ollama exposes a local REST API at http://localhost:11434 once the model is running. You can call it from any HTTP client — no cloud dependency, no API key.

curl (generate)

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "prompt": "Explain the difference between MoE and dense transformer architectures.",
    "stream": false
  }'

curl (chat, OpenAI-compatible)

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4",
    "messages": [
      {"role": "user", "content": "Write a Python function to parse JSON safely."}
    ]
  }'

Python (ollama library)

from ollama import chat

response = chat(
    model='gemma4',
    messages=[{'role': 'user', 'content': 'What is mixture of experts?'}],
)
print(response.message.content)

Python (OpenAI SDK, drop-in compatible)

Because Ollama's API is OpenAI-compatible, you can point the official OpenAI SDK at your local instance:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK but unused by Ollama
)

response = client.chat.completions.create(
    model="gemma4",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a function to flatten a nested list in Python."}
    ]
)

print(response.choices[0].message.content)

JavaScript

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'gemma4',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

What Gemma 4 can do that Gemma 3 could not

These are not incremental improvements — the benchmark gaps are substantial:

Benchmark	Gemma 4 31B	Gemma 4 E4B	Gemma 3 27B
AIME 2026 (math reasoning)	89.2%	42.5%	20.8%
LiveCodeBench v6 (coding)	80.0%	52.0%	29.1%
Codeforces ELO	2150	940	110
MMLU Pro (knowledge)	85.2%	69.4%	67.6%
GPQA Diamond (science)	84.3%	58.6%	42.4%

Beyond benchmarks, Gemma 4 adds capabilities that were absent from Gemma 3:

Native function calling — all four variants support structured tool use out of the box, returning valid JSON matching your schema
Thinking modes — you can enable or disable chain-of-thought reasoning per request using the <|think|> token in the system prompt
256K context on the 26B and 31B models (up from 128K in Gemma 3 27B)
Audio input on E2B and E4B — speech recognition and understanding alongside text and image
140+ languages natively supported

Thinking mode

Gemma 4 supports configurable chain-of-thought reasoning. To enable it, include the <|think|> token at the start of your system prompt:

from ollama import chat

response = chat(
    model='gemma4:31b',
    messages=[
        {
            'role': 'system',
            'content': '<|think|> Think step by step before answering.'
        },
        {
            'role': 'user',
            'content': 'What is the integral of x^2 from 0 to 3?'
        }
    ],
)
print(response.message.content)

To disable thinking, remove the <|think|> token from the system prompt. For E2B and E4B, thinking is fully off when the token is absent. For 26B and 31B, the model still generates the thought tags but with an empty thought block.

For simple lookups or casual chat, skip thinking. For math, complex coding, or document analysis, enable it — the quality difference is significant on the larger models.

Common errors and fixes

Error: gemma4:e4b requires a newer version of Ollama

Your Ollama build predates v0.20.0. Run the update command for your OS (see Prerequisite section above), then try again.

Out of memory / model fails to load

Check available VRAM or unified memory with ollama ps. If the model is too large, switch to a smaller tag. gemma4:e2b (7.2 GB) is the lightest official option.

Slow responses (1–5 tokens/second)

If Ollama is not using your GPU, the model is running on CPU. Check that your GPU drivers are current and that Ollama can see your GPU. On Apple Silicon, make sure you are on a recent Ollama build — MLX acceleration support was added in v0.20.0.

Port 11434 already in use

Another Ollama instance is running, or something else has taken the port. You can set a custom port:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Then update your API calls to use port 11435.

Responses cut off before completing

The context window may be too small for your prompt. Increase it per request:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "gemma4",
    "prompt": "...",
    "options": {"num_ctx": 32768}
  }'

gemma4:26b barely fits but feels slow

The 26B model on a 24 GB machine leaves very little memory headroom. Other processes competing for memory will degrade performance significantly. Close other GPU-heavy applications, or drop to e4b if you need consistent responsiveness.

What to check before blaming the model

If the output quality feels worse than expected, run through this list before switching to a larger model:

Confirm you are on the model size you intended — ollama list shows what is installed
Check that GPU inference is active — ollama ps shows which processor is being used
Try enabling thinking mode if the task involves reasoning or math
Check that your context window is large enough for the full prompt
Use the recommended sampling settings: temperature=1.0, top_p=0.95, top_k=64

In most cases, e4b with thinking mode enabled handles tasks that initially seemed to require 31b.

Next steps

If Ollama is not the right fit for your setup, two common alternatives:

LM Studio — a GUI-first local runtime, good if you prefer not working in the terminal
llama.cpp — more configuration control, better for CPU-heavy or constrained environments

If you want to try Gemma 4 without any local setup, Google AI Studio offers hosted access to the 31B and 26B models.