Nrtyopasana Logo

RLHF

Date Published

---- alternative content

The primary algorithm used in early RLHF systems is Proximal Policy Optimization (PPO). PPO uses a learned reward model to score outputs and updates the language model while limiting drastic behavior changes. It helped stabilize training for complex behaviors like reasoning. However, PPO-based systems were complex, expensive, and sensitive to training choices. Direct Preference Optimization (DPO) simplified this process.

Instead of learning a reward model, DPO trains directly on preferred vs. rejected outputs. It encourages the model to assign higher probability to preferred reasoning patterns and reduces training complexity and instability. DPO makes it easier to align models with preferences for structured reasoning and clear chain-of-thought outputs. There have been a lot of advancements in this direction which I will delve into in the next blog.

While DPO optimizes preferences over final outputs, researchers observed that reasoning quality itself often matters independently of the final answer. Two responses can be equally correct but differ significantly in clarity, structure, or logical consistency. Thought Preference Optimization (TPO) extends preference optimization to reasoning paths or chains of thought. Instead of preferring one answer over another, TPO prefers one reasoning process over another. This encourages models to generate clearer intermediate steps, avoid unnecessary assumptions, and produce more coherent multi-step explanations—especially important for math, science, coding, and educational use cases.


------

Early RLHF systems relied on Proximal Policy Optimization (PPO) to train language models. In this setup, a separate reward model is first trained from human preferences, and PPO then optimizes the model on‑policy using those rewards while constraining updates for stability. This enabled complex behaviors like reasoning—but at the cost of high complexity, expensive training, and sensitivity to hyperparameters.


Direct Preference Optimization (DPO) simplified things by removing the explicit RL step. Instead of training a reward model and then optimizing against it, DPO trains directly on preferred vs. rejected outputs, encouraging the model to assign higher probability to preferred responses. The result: simpler training, better stability, and easier alignment with structured reasoning.


More recently, researchers have pushed preference optimization even further. Thought Preference Optimization (TPO) shifts the focus from final answers to the reasoning process itself. Two answers can be correct, but one can be clearer, more structured, or more logically sound. By preferring better reasoning paths, TPO encourages clearer intermediate steps—especially valuable for math, coding, science, and education.