Gemma 4 Guides
How to Fine-Tune Gemma 4 with Unsloth: Step-by-Step Guide

Unsloth has day-zero support for all four Gemma 4 variants β E2B, E4B, 26B-A4B, and 31B. It trains models 2x faster and uses up to 70% less VRAM compared to standard Hugging Face training, which makes it the practical choice for anyone working on consumer hardware.
This guide covers everything from picking the right model variant through exporting your finished adapter for use in Ollama, llama.cpp, or LM Studio.
Which Gemma 4 model should you fine-tune?
The answer depends on your hardware and what you want to accomplish.
| Model | VRAM for training (LoRA, bf16) | Best for |
|---|---|---|
| E4B | ~8β10 GB | Laptops, RTX 3060/4060, free Colab |
| E2B | ~6β8 GB | Multimodal + audio tasks, tightest budgets |
| 26B-A4B | ~24+ GB (16-bit LoRA) | Speed/quality balance, RTX 3090/4090 |
| 31B | ~20 GB (QLoRA 4-bit) | Maximum quality, NVIDIA A100 / dual GPU |
A few notes that matter:
E4B is the right starting point for most people. It runs on free Google Colab (T4 GPU), fits on any RTX GPU with 12 GB or more, and supports both text and multimodal (image + audio) fine-tuning. Unsloth provides free Colab notebooks for it.
Avoid QLoRA on the 26B-A4B MoE. Because the 26B-A4B is a Mixture-of-Experts model, Unsloth recommends 16-bit LoRA instead of 4-bit QLoRA. The MoE routing and 4-bit quantization interact poorly. Use load_in_16bit=True rather than load_in_4bit=True for this variant.
The 31B is a good fine-tuning target when quality matters most. It is a dense model (QLoRA works well) and currently ranks #3 on the Arena AI text leaderboard among open models.
Before you start: install and update Unsloth
If you do not have Unsloth installed, start here. There are two ways to use it: Unsloth Studio (a no-code web UI) and Unsloth Core (the Python library for code-based workflows).
Option A β Unsloth Studio (no-code, recommended for beginners)
macOS, Linux, or WSL:
curl -fsSL https://unsloth.ai/install.sh | sh
Windows PowerShell:
irm https://unsloth.ai/install.ps1 | iex
Once installed, launch the Studio:
unsloth studio -H 0.0.0.0 -p 8888
Then open http://localhost:8888 in your browser. On first launch you will create a password, then use the wizard to search for Gemma 4, select a model size, and pick a dataset. Unsloth Studio handles the rest β you monitor training progress from the UI and export when done.
Option B β Unsloth Core (code-based)
If you already have an existing Unsloth installation, update it first:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
For a fresh install with auto-detected PyTorch backend:
pip install unsloth --torch-backend=auto
Preparing your dataset
Gemma 4 uses the standard conversational format with user and model roles β not the older Gemma-specific formats from Gemma 1/2/3. Your training data should look like this:
{
"messages": [
{
"role": "user",
"content": "Classify the sentiment of this review: 'Shipping was late but the product is excellent.'"
},
{
"role": "model",
"content": "Mixed β negative sentiment toward shipping, positive toward the product."
}
]
}
A few formatting rules specific to Gemma 4:
- Use
"role": "model"(not"assistant") for Gemma 4 responses. This matches the tokenizer's chat template. - Native system prompt support is new in Gemma 4. You can include a
{"role": "system", "content": "..."}message at the start of each conversation. - To enable thinking mode during training, put
<|think|>at the start of your system prompt. If you want to preserve the model's reasoning ability, mix reasoning-style examples with direct answers β Unsloth recommends keeping at least 75% reasoning examples if you care about that capability. - For multi-turn conversations, only include the final visible answer in the training target. Do not feed earlier thought blocks back into subsequent turns.
Dataset size guidelines:
- Style or tone fine-tuning: 200β1,000 high-quality examples
- Domain adaptation (medical, legal, technical): 10,000β50,000 examples
- Instruction following: 5,000β20,000 examples
- Classification or extraction tasks: 500β5,000 examples
Always hold out 10β20% of your data for evaluation. One noisy or mislabeled example can undo the benefit of dozens of clean ones.
Text fine-tuning: complete code walkthrough
This example fine-tunes Gemma 4 E4B with LoRA for a text task. Replace the dataset URL with your own data source.
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
max_seq_length = 2048 # Start conservative. Scale up once the pipeline works.
# Load your dataset β it needs a "text" column, or use a chat-formatted dataset
dataset = load_dataset("json", data_files={"train": "your_dataset.jsonl"}, split="train")
# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "google/gemma-4-e4b-it",
max_seq_length = max_seq_length,
load_in_4bit = False, # For E4B and 31B dense: QLoRA (4-bit) also works
load_in_16bit = True, # bf16 LoRA β recommended starting point
full_finetuning = False,
)
# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank β higher = more capacity, more VRAM
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth", # Required for long context or tight VRAM
random_state = 3407,
max_seq_length = max_seq_length,
)
# Train
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
tokenizer = tokenizer,
args = SFTConfig(
max_seq_length = max_seq_length,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 100, # Replace with num_train_epochs for a real run
logging_steps = 1,
output_dir = "outputs_gemma4",
optim = "adamw_8bit",
seed = 3407,
dataset_num_proc = 1,
),
)
trainer.train()
Running out of memory? Two things to try first: drop per_device_train_batch_size to 1, and reduce max_seq_length. Keep use_gradient_checkpointing="unsloth" β it is specifically designed to reduce VRAM usage and extend the effective context length during training.
Fine-tuning the 26B-A4B MoE model
The MoE model needs a slightly different loader. Use FastModel instead of FastLanguageModel, and keep 16-bit LoRA:
import os
import torch
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/Gemma-4-26B-A4B-it",
max_seq_length = 2048,
load_in_4bit = False, # QLoRA not recommended for MoE
load_in_16bit = True, # bf16 LoRA
full_finetuning = False,
)
Once loaded, attach LoRA adapters and train the same way as the E4B example above. For MoE fine-tuning, Unsloth recommends starting with rank 16 and shorter context lengths, then scaling up only after your pipeline is stable.
Multimodal fine-tuning (E2B and E4B)
E2B and E4B are the Gemma 4 variants designed for multimodal tasks β they natively process images and audio. If your fine-tuning task involves images, use FastVisionModel:
from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "google/gemma-4-e4b-it",
max_seq_length = 2048,
load_in_4bit = False,
use_gradient_checkpointing = "unsloth",
)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = False, # Start with text-only to save VRAM
finetune_language_layers = True,
finetune_attention_modules = True,
finetune_mlp_modules = True,
r = 16,
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
target_modules = "all-linear",
)
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
data_collator = UnslothVisionDataCollator(model, tokenizer),
train_dataset = dataset,
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
learning_rate = 2e-4,
output_dir = "outputs_gemma4_vision",
),
)
trainer.train()
Important: In Gemma 4 multimodal prompts, the image must come before the text instruction in the message content list. The format looks like this:
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe what is happening in this image."}
]
},
{
"role": "model",
"content": "..."
}
]
}
For audio fine-tuning (E2B / E4B only): keep audio clips short and task-specific. Unsloth recommends following their Vision RL notebook examples for audio workflows.
Saving and exporting your fine-tuned model
After training, you have several export paths depending on where you plan to deploy.
Save as LoRA adapter (fastest, smallest)
model.save_pretrained("gemma4_e4b_adapter")
tokenizer.save_pretrained("gemma4_e4b_adapter")
This saves only the adapter weights β a few hundred MB rather than the full model. You need the base model at inference time.
Merge and export to GGUF (for llama.cpp / Ollama / LM Studio)
model.save_pretrained_gguf(
"gemma4_e4b_finetuned",
tokenizer,
quantization_method = "q4_k_m" # or "q8_0", "f16"
)
The resulting .gguf file can be loaded directly in llama.cpp, imported into Ollama with a custom Modelfile, or loaded in LM Studio.
Export from Unsloth Studio
If you used the Studio UI, go to the Export tab after training completes. Select GGUF, safetensors, or both. The Studio handles merging automatically.
One thing to watch: if your exported model behaves unexpectedly in another runtime (Ollama, llama.cpp), the most common cause is a mismatched chat template or EOS token. Always use the same chat template at inference that you used during training.
Free Colab notebooks
Unsloth provides free Google Colab notebooks for all Gemma 4 sizes. The E2B/E4B notebooks run on a free T4 GPU. The 26B-A4B and 31B notebooks require an A100 (Colab Pro).
| Model | Task | Link |
|---|---|---|
| E2B | Text | Open in Colab |
| E2B | Vision | Open in Colab |
| E2B | Audio | Open in Colab |
| 26B-A4B | Vision (A100) | Open in Colab |
| 31B | Vision (A100) | Open in Colab |
Common issues
Out of memory during training. Reduce max_seq_length first, then reduce per_device_train_batch_size to 1. Make sure use_gradient_checkpointing="unsloth" is set β it is specifically designed to extend context length and reduce VRAM, not just a generic flag.
Using QLoRA on 26B-A4B. The MoE architecture and 4-bit quantization interact poorly. Stick to 16-bit LoRA (load_in_16bit=True) for the MoE model.
Chat template mismatch after export. If your exported model responds incorrectly in Ollama or llama.cpp, check that the inference runtime is using the same Gemma 4 chat template you trained with. This is the most common cause of degraded post-export behavior.
Reasoning ability degraded after fine-tuning. If you want to preserve Gemma 4's built-in reasoning capability, mix reasoning-style examples with direct-answer examples in your training data. Unsloth recommends at least 75% reasoning examples to maintain the thinking behavior.
For the official training reference, see Unsloth documentation.
Related guides
Related guides
Continue through the Gemma 4 cluster with the next guide that matches your current decision.

Gemma 4 Unsloth Guide: When It Makes Sense and What to Watch
Use this guide to understand where Unsloth fits into a Gemma 4 workflow and what to decide before you jump into tuning.

Gemma 4 GGUF Download Guide: Safe Sources, Quant Tips, and Local Setup
Use this Gemma 4 GGUF download guide to pick a trusted source, choose the right file, and get from download to first local response with less guesswork.

How to Run Gemma 4 with llama.cpp: GGUF Setup, Hardware & Quantization Guide
Everything you need to get Gemma 4 running locally with llama.cpp: hardware tables, copy-paste build commands, quantization guide, and multimodal setup.
Still deciding what to read next?
Go back to the guide hub to browse model comparisons, setup walkthroughs, and hardware planning pages.
