How to Fine-Tune Gemma 4 with Unsloth: Step-by-Step Guide

Unsloth had day-zero support for the original Gemma 4 variants — E2B, E4B, 26B-A4B, and 31B — and the official Gemma 4 family now also includes 12B. Check Unsloth's current model page before assuming the newest 12B notebooks or loaders are available in the exact same form. For supported sizes, Unsloth trains models 2x faster and uses up to 70% less VRAM compared to standard Hugging Face training, which makes it the practical choice for anyone working on consumer hardware.

This guide covers everything from picking the right model variant through exporting your finished adapter for use in Ollama, llama.cpp, or LM Studio.

Which Gemma 4 model should you fine-tune?

The answer depends on your hardware and what you want to accomplish.

Model	VRAM for training (LoRA, bf16)	Best for
E4B	~8–10 GB	Laptops, RTX 3060/4060, free Colab
E2B	~6–8 GB	Multimodal + audio tasks, tightest budgets
12B	Check current Unsloth support	Stronger multimodal middle tier when supported
26B-A4B	~24+ GB (16-bit LoRA)	Speed/quality balance, RTX 3090/4090
31B	~20 GB (QLoRA 4-bit)	Maximum quality, NVIDIA A100 / dual GPU

A few notes that matter:

E4B is the right starting point for most people. It runs on free Google Colab (T4 GPU), fits on any RTX GPU with 12 GB or more, and supports both text and multimodal (image + audio) fine-tuning. Unsloth provides free Colab notebooks for it.

Avoid QLoRA on the 26B-A4B MoE. Because the 26B-A4B is a Mixture-of-Experts model, Unsloth recommends 16-bit LoRA instead of 4-bit QLoRA. The MoE routing and 4-bit quantization interact poorly. Use load_in_16bit=True rather than load_in_4bit=True for this variant.

The 31B is a good fine-tuning target when quality matters most. It is a dense model (QLoRA works well) and currently ranks #3 on the Arena AI text leaderboard among open models.

Before you start: install and update Unsloth

If you do not have Unsloth installed, start here. There are two ways to use it: Unsloth Studio (a no-code web UI) and Unsloth Core (the Python library for code-based workflows).

Option A — Unsloth Studio (no-code, recommended for beginners)

macOS, Linux, or WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

Once installed, launch the Studio:

unsloth studio -H 0.0.0.0 -p 8888

Then open http://localhost:8888 in your browser. On first launch you will create a password, then use the wizard to search for Gemma 4, select a model size, and pick a dataset. Unsloth Studio handles the rest — you monitor training progress from the UI and export when done.

Option B — Unsloth Core (code-based)

If you already have an existing Unsloth installation, update it first:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

For a fresh install with auto-detected PyTorch backend:

pip install unsloth --torch-backend=auto

Preparing your dataset

Gemma 4 uses the standard conversational format with user and model roles — not the older Gemma-specific formats from Gemma 1/2/3. Your training data should look like this:

{
  "messages": [
    {
      "role": "user",
      "content": "Classify the sentiment of this review: 'Shipping was late but the product is excellent.'"
    },
    {
      "role": "model",
      "content": "Mixed — negative sentiment toward shipping, positive toward the product."
    }
  ]
}

A few formatting rules specific to Gemma 4:

Use "role": "model" (not "assistant") for Gemma 4 responses. This matches the tokenizer's chat template.
Native system prompt support is new in Gemma 4. You can include a {"role": "system", "content": "..."} message at the start of each conversation.
To enable thinking mode during training, put <|think|> at the start of your system prompt. If you want to preserve the model's reasoning ability, mix reasoning-style examples with direct answers — Unsloth recommends keeping at least 75% reasoning examples if you care about that capability.
For multi-turn conversations, only include the final visible answer in the training target. Do not feed earlier thought blocks back into subsequent turns.

Dataset size guidelines:

Style or tone fine-tuning: 200–1,000 high-quality examples
Domain adaptation (medical, legal, technical): 10,000–50,000 examples
Instruction following: 5,000–20,000 examples
Classification or extraction tasks: 500–5,000 examples

Always hold out 10–20% of your data for evaluation. One noisy or mislabeled example can undo the benefit of dozens of clean ones.

Text fine-tuning: complete code walkthrough

This example fine-tunes Gemma 4 E4B with LoRA for a text task. Replace the dataset URL with your own data source.

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

max_seq_length = 2048  # Start conservative. Scale up once the pipeline works.

# Load your dataset — it needs a "text" column, or use a chat-formatted dataset
dataset = load_dataset("json", data_files={"train": "your_dataset.jsonl"}, split="train")

# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "google/gemma-4-e4b-it",
    max_seq_length = max_seq_length,
    load_in_4bit = False,    # For E4B and 31B dense: QLoRA (4-bit) also works
    load_in_16bit = True,    # bf16 LoRA — recommended starting point
    full_finetuning = False,
)

# Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                  # LoRA rank — higher = more capacity, more VRAM
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Required for long context or tight VRAM
    random_state = 3407,
    max_seq_length = max_seq_length,
)

# Train
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    tokenizer = tokenizer,
    args = SFTConfig(
        max_seq_length = max_seq_length,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 100,        # Replace with num_train_epochs for a real run
        logging_steps = 1,
        output_dir = "outputs_gemma4",
        optim = "adamw_8bit",
        seed = 3407,
        dataset_num_proc = 1,
    ),
)

trainer.train()

Running out of memory? Two things to try first: drop per_device_train_batch_size to 1, and reduce max_seq_length. Keep use_gradient_checkpointing="unsloth" — it is specifically designed to reduce VRAM usage and extend the effective context length during training.

Fine-tuning the 26B-A4B MoE model

The MoE model needs a slightly different loader. Use FastModel instead of FastLanguageModel, and keep 16-bit LoRA:

import os
import torch
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/Gemma-4-26B-A4B-it",
    max_seq_length = 2048,
    load_in_4bit = False,     # QLoRA not recommended for MoE
    load_in_16bit = True,     # bf16 LoRA
    full_finetuning = False,
)

Once loaded, attach LoRA adapters and train the same way as the E4B example above. For MoE fine-tuning, Unsloth recommends starting with rank 16 and shorter context lengths, then scaling up only after your pipeline is stable.

Multimodal fine-tuning (E2B, E4B, and 12B)

E2B, E4B, and the newer 12B are the Gemma 4 variants with native audio support; all current Gemma 4 sizes handle image/video input. If your fine-tuning task involves images, use FastVisionModel where your chosen size is supported by Unsloth:

from unsloth import FastVisionModel
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastVisionModel.from_pretrained(
    model_name = "google/gemma-4-e4b-it",
    max_seq_length = 2048,
    load_in_4bit = False,
    use_gradient_checkpointing = "unsloth",
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = False,  # Start with text-only to save VRAM
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,
    r = 16,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    target_modules = "all-linear",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer),
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        learning_rate = 2e-4,
        output_dir = "outputs_gemma4_vision",
    ),
)

trainer.train()

Important: In Gemma 4 multimodal prompts, the image must come before the text instruction in the message content list. The format looks like this:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "path/to/image.jpg"},
        {"type": "text",  "text": "Describe what is happening in this image."}
      ]
    },
    {
      "role": "model",
      "content": "..."
    }
  ]
}

For audio fine-tuning (E2B, E4B, and 12B where supported): keep audio clips short and task-specific. Unsloth recommends following their Vision RL notebook examples for audio workflows.

Saving and exporting your fine-tuned model

After training, you have several export paths depending on where you plan to deploy.

Save as LoRA adapter (fastest, smallest)

model.save_pretrained("gemma4_e4b_adapter")
tokenizer.save_pretrained("gemma4_e4b_adapter")

This saves only the adapter weights — a few hundred MB rather than the full model. You need the base model at inference time.

Merge and export to GGUF (for llama.cpp / Ollama / LM Studio)

model.save_pretrained_gguf(
    "gemma4_e4b_finetuned",
    tokenizer,
    quantization_method = "q4_k_m"   # or "q8_0", "f16"
)

The resulting .gguf file can be loaded directly in llama.cpp, imported into Ollama with a custom Modelfile, or loaded in LM Studio.

Export from Unsloth Studio

If you used the Studio UI, go to the Export tab after training completes. Select GGUF, safetensors, or both. The Studio handles merging automatically.

One thing to watch: if your exported model behaves unexpectedly in another runtime (Ollama, llama.cpp), the most common cause is a mismatched chat template or EOS token. Always use the same chat template at inference that you used during training.

Free Colab notebooks

Unsloth provides free Google Colab notebooks for the Gemma 4 sizes it currently supports. The E2B/E4B notebooks run on a free T4 GPU. The 26B-A4B and 31B notebooks require an A100 (Colab Pro). For the newer 12B tier, check the current notebook index before planning a training run.

Model	Task	Link
E2B	Text	Open in Colab
E2B	Vision	Open in Colab
E2B	Audio	Open in Colab
26B-A4B	Vision (A100)	Open in Colab
31B	Vision (A100)	Open in Colab

Common issues

Out of memory during training. Reduce max_seq_length first, then reduce per_device_train_batch_size to 1. Make sure use_gradient_checkpointing="unsloth" is set — it is specifically designed to extend context length and reduce VRAM, not just a generic flag.

Using QLoRA on 26B-A4B. The MoE architecture and 4-bit quantization interact poorly. Stick to 16-bit LoRA (load_in_16bit=True) for the MoE model.

Chat template mismatch after export. If your exported model responds incorrectly in Ollama or llama.cpp, check that the inference runtime is using the same Gemma 4 chat template you trained with. This is the most common cause of degraded post-export behavior.

Reasoning ability degraded after fine-tuning. If you want to preserve Gemma 4's built-in reasoning capability, mix reasoning-style examples with direct-answer examples in your training data. Unsloth recommends at least 75% reasoning examples to maintain the thinking behavior.

For the official training reference, see Unsloth documentation.