Gemma 3

Google's open, multimodal, multilingual long-context model family.

Context window

128K tokens (32K for the 1B variant)

Input / 1M tokens

Free

Output / 1M tokens

Free

Provider

Google DeepMind

Open-weight model under the Gemma license - free to download and self-host. Running costs depend on your own hardware or inference provider pricing (e.g., Together AI, Ollama, Hugging Face). · Data verified 2026-07-02

Gemma 3 is Google's open-model family released March 12, 2025, in 1B, 4B, 12B, and 27B sizes (a 270M variant was added later). The 4B, 12B, and 27B models are multimodal, accepting image and text input (and video for larger sizes) with text output, while the 1B is text-only. It supports a 128K-token context window (32K for 1B), understands over 140 languages, and adds improved math, reasoning, structured outputs, and function calling. It is released under the Gemma license and is widely used for local and on-prem deployment.

Capability index

Relative estimates (0-100) to place this model against its peers, grounded in published benchmarks.

Coding

Reasoning

Math

Multimodal

Long context

Speed

Cost efficiency

How to access it

Download the open weights from Hugging Face (e.g., google/gemma-3-27b-it) or Kaggle and run via Transformers, Ollama, or vLLM. Also available through hosted inference providers.

Get access →Documentation →

Strengths

✓Open weights, self-hostable across a range of sizes (1B-27B)
✓Multimodal (image + text) on 4B and larger
✓128K-token context window
✓Strong multilingual coverage (140+ languages)
✓Function calling and structured outputs

Best for developers who...

Local and on-prem multimodal assistantsMultilingual applicationsFine-tuning on custom dataPrivacy-sensitive deployments

When to choose it (and when not to)

Reach for Gemma 3 when...

→You need an open, self-hostable model with a size to match your hardware
→Multilingual or multimodal tasks on-prem
→Privacy-sensitive or offline deployments
→Fine-tuning on your own data

Look elsewhere if...

✕You need frontier-level quality (superseded by Gemma 4 and hosted frontier models)
✕You want a fully managed API without running infrastructure
✕Very hard reasoning tasks the small variants can't handle

How to use it

›Pick the smallest size that meets your quality bar to save compute (e.g., 4B for edge, 27B for quality)
›Use the '-it' instruction-tuned checkpoints for chat
›Apply the Gemma chat template for correct turn formatting
›Quantize (4-8 bit) to fit smaller GPUs

Quickstart

Python

from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-3-27b-it", device_map="auto")
messages = [{"role": "user", "content": [{"type": "text", "text": "Describe transformers in one line."}]}]
print(pipe(messages, max_new_tokens=128)[0]["generated_text"])

Install `transformers` and `accelerate`. Accept the Gemma license on Hugging Face and run `huggingface-cli login`. Or use `ollama run gemma3:27b`.

API model id: google/gemma-3-27b-it

Benchmarks

Benchmark	Score	Notes
MATH (27B)	89%	Per Google DeepMind Gemma 3 page
MMMU (27B, multimodal)	64.9%	Per Google DeepMind Gemma 3 page

Source: Google DeepMind - Gemma 3

Compare Gemma 3 with

Gemma 3 vs Llama 4

Meta - Up to 10M tokens (Scout); ~1M tokens (Maverick) ctx

Compare →

Gemma 3 vs Qwen 3

Alibaba (Qwen Team) - 128K tokens (32K for 0.6B/1.7B/4B dense variants) ctx

Compare →

Gemma 3 vs Mistral Large

Mistral AI - 128000 ctx

Compare →

All model comparisons →

Learn the concepts

Fine-tuning LoRA (Low-Rank Adaptation)Distillation Inference

← All AI models