Fine-tuning

A pre-trained LLM has broad knowledge but generic behavior. Fine-tuning updates the model's weights on a curated dataset to shift it toward a specific format, style, domain, or task without starting training from scratch.

When to fine-tune vs prompt-engineer

Fine-tuning makes sense when:

The task requires a consistently specific output format that prompting alone struggles to maintain
Domain-specific language or jargon needs to be baked in (medical notes, legal contracts, internal terminology)
Latency or cost is critical - fine-tuned smaller models often outperform large models on narrow tasks
You have hundreds or thousands of labeled examples

If you have fewer than 50 examples, start with few-shot prompting and retrieval first.

Fine-tuning approaches

Full fine-tuning: Updates all weights. Most expensive but most flexible. Requires significant GPU memory.
LoRA / QLoRA: Adds small adapter layers; only those are updated. 10-100x cheaper than full fine-tuning. The dominant approach for open-source models.
RLHF / DPO: Trains the model to prefer outputs matching human preferences. Used by Anthropic, OpenAI, and others to improve chat behavior.

Fine-tuning services

OpenAI, Anthropic, Mistral, and Together AI all offer fine-tuning APIs. For open-source models, Hugging Face's PEFT library with LoRA is the standard toolkit.

Common mistakes

Overfitting on too small a dataset causes catastrophic forgetting, where the model loses general capabilities. Data quality matters far more than quantity - 200 high-quality examples beat 10,000 noisy ones.

When to fine-tune vs prompt-engineer

Fine-tuning approaches

Fine-tuning services

Common mistakes

Related terms

Models relevant to Fine-tuning