RLHF (Reinforcement Learning from Human Feedback)

Pre-trained language models excel at predicting text based on statistical patterns, but they don't inherently follow instructions or refuse harmful requests. RLHF is the training technique that transforms raw language models into usable assistants by aligning their outputs with human preferences.

The RLHF pipeline

Supervised fine-tuning (SFT): Fine-tune the pre-trained model on a curated dataset of high-quality demonstrations. This teaches basic instruction following and establishes a behavioral baseline.
Reward model training: Human annotators compare pairs of model outputs and label which is better. A separate neural network (the reward model) learns to predict these human preferences.
RL optimization: The language model generates responses and is optimized using reinforcement learning algorithms like PPO to maximize reward model scores while staying close to the SFT model through KL divergence penalties.

RLHF and model alignment

RLHF became the standard approach for aligning models like Claude, GPT-4, and Llama 2. It enables models to refuse harmful requests, express uncertainty, and maintain conversational coherence. Without this step, pre-trained models generate whatever text is most probable given the context, regardless of safety or practical utility.

The technique's success comes from explicitly encoding human judgment into the optimization objective. Rather than relying solely on next-token prediction loss, models learn to optimize for outcomes humans actually prefer.

Recent alternatives and refinements

Direct Preference Optimization (DPO) and related methods like IPO achieve comparable results without maintaining a separate reward model or explicit reinforcement learning. These approaches reframe preference learning as a classification problem, reducing computational overhead and training instability. Many recent open-source models now use DPO or similar techniques instead of full RLHF.

Other developments include online RLHF, where preference data comes from live model outputs rather than static datasets, and constitutional AI approaches that use principle-based feedback instead of purely preference-based ranking. Variants like reward model distillation and preference model approaches continue to evolve as the field optimizes for cost, stability, and alignment quality.

The RLHF pipeline

RLHF and model alignment

Recent alternatives and refinements

Related terms

Models relevant to RLHF (Reinforcement Learning from Human Feedback)