Distillation
Training a smaller, cheaper model to mimic the outputs of a larger, more capable model.
Knowledge distillation transfers capability from a large "teacher" model to a smaller "student" model. Rather than training the student on ground-truth labels, it's trained to match the teacher's output distribution - the full probability vector over tokens, not just the top-1 prediction. This richer training signal helps the student generalize better than if trained from scratch on the same data.
Why distillation matters in practice
Many smaller models that perform well "above their weight class" were distilled from larger ones. GPT-4o mini, Gemini 2.5 Flash, and most sub-10B coding models use some form of distillation. The result: a model that's 10-50x cheaper to run but retains much of the larger model's capability on targeted tasks.
Distillation approaches
- Response distillation: Generate teacher outputs on a dataset; train the student to reproduce them. Simple and effective.
- Feature distillation: Match intermediate layer activations. More complex but can transfer richer representations.
- Reasoning distillation: The teacher generates chain-of-thought reasoning; the student learns to produce similar reasoning traces. Used to train smaller "thinking" models.
Limits of distillation
A student can never fully exceed the teacher, and distillation degrades on out-of-distribution inputs that the teacher handled via general reasoning. For creative or open-ended tasks, the gap between student and teacher tends to be larger than for structured tasks.