Distillation

Knowledge distillation transfers capability from a large "teacher" model to a smaller "student" model. Rather than training the student on ground-truth labels, it's trained to match the teacher's output distribution - the full probability vector over tokens, not just the top-1 prediction. This richer training signal helps the student generalize better than if trained from scratch on the same data.

Why distillation matters in practice

Many smaller models that perform well above their weight class were distilled from larger ones. GPT-4o mini, Gemini 2.5 Flash, Claude 3.5 Haiku, and most sub-10B coding models use some form of distillation. The result: a model that's 10-50x cheaper to run but retains much of the larger model's capability on targeted tasks.

Distillation has become a standard practice in production ML. When Anthropic alleged in 2024 that Alibaba had distilled Claude's capabilities without permission, it underscored how valuable distilled models are to enterprises and how seriously AI labs treat unauthorized model extraction.

Distillation approaches

Response distillation: Generate teacher outputs on a dataset; train the student to reproduce them. Simple and effective.
Feature distillation: Match intermediate layer activations. More complex but can transfer richer representations.
Reasoning distillation: The teacher generates chain-of-thought reasoning; the student learns to produce similar reasoning traces. Used to train smaller reasoning-focused models.

Limits of distillation

A student cannot fully exceed the teacher's capabilities. Distillation degrades on out-of-distribution inputs that the teacher handled through general reasoning. For creative or open-ended tasks, the performance gap between student and teacher tends to be larger than for structured tasks. Distillation works best when the student targets specific capabilities rather than attempting broad replication.

Why distillation matters in practice

Distillation approaches

Limits of distillation

Related terms

Models relevant to Distillation