RLHF (Reinforcement Learning from Human Feedback)
A training method that uses human preference ratings to shape a model's behavior, making it more helpful, honest, and safe.
Pre-trained LLMs are good at predicting text; they're not inherently good at being helpful assistants. RLHF is the training technique that turns raw language models into the chatbots and assistants we use today.
The RLHF pipeline
- Supervised fine-tuning (SFT): Fine-tune the pre-trained model on a curated dataset of high-quality demonstrations. This teaches basic instruction following.
- Reward model training: Human annotators compare pairs of model outputs and label which is better. A separate neural network (the reward model) is trained to predict human preferences.
- RL optimization: The language model generates responses and is optimized (via PPO or similar algorithms) to maximize reward model scores while not diverging too far from the SFT model.
RLHF and model alignment
RLHF is the primary tool that makes models like Claude and GPT-4 refuse harmful requests, acknowledge uncertainty, and maintain conversation context. Without it, pre-trained models will generate whatever text is most probable given the context, regardless of safety or helpfulness.
DPO: a simpler alternative
Direct Preference Optimization (DPO) achieves similar results to RLHF without a separate reward model and without reinforcement learning, using a simpler classification-like objective. Most recent open-source model training uses DPO rather than full RLHF, due to its stability and simplicity.