TPO, DPO, PPO, RLHF |

Have you observed that when you ask ChatGPT certain questions, it goes into some sort of Thinking mode (Thinking...) whereas for other questions you get an immediate response.

This behavior is not accidental. It is how the GPT-5 was designed.

GPT-5 operates as a unified system rather than a single fixed model. It includes:

- A fast, general-purpose model for most questions

- A deeper reasoning model (GPT‑5 thinking) for more complex tasks

- A real-time router that decides which mode to use based on the prompt’s complexity, required tools, and explicit instructions/intent like “think step by step”

During pre-training, LLMs are trained to predict the next possible token. While this helps build their strong linguistic ability, it does not help them produce structured reasoning that is reliable. Researchers noted that in multi-step problem, models often jump to answers without intermediate steps, accumulating errors. A lot of times explanations were also inconsistent or misleading.

As the size and complexity of language models grew, their natural "emergent" reasoning capabilities did as well. The goal was to activate this hidden potential. The key discovery was that by simply incorporating phrases such as "Let's think step by step" into the prompt or by sharing a few examples that contain the intermediate reasoning processes, the model was encouraged to dissect the problem into a logical sequence, similar to how a human would demonstrate their work. This sequential approach significantly enhanced accuracy and offered insight into the model's reasoning process.

What is the idea behind the 'Thinking' model?

The thinking model is designed for producing structured outputs that involve multiple steps. It employs Chain-of-Thought (CoT) techniques, enabling the model to provide intermediate steps prior to delivering the final answer. These intermediate steps enhance performance on tasks related to math, logic, and coding, though they still represent generated text rather than an actual cognitive process. CoT reflects the implicit reasoning that an AI model undergoes before producing an output.

It serves as a sort of private workspace for resolving issues. In sophisticated reasoning models, this internal dialogue frequently resembles a systematic checklist articulated in everyday language. For example: “Step 1: Decompose the question into more manageable components. Step 2: Retrieve pertinent information or formulas from previous knowledge. Step 3: Tackle each component systematically. Step 4: Merge the findings into a coherent, final response.”

This created a new goal:

not just correct answers, but better reasoning paths.

Entering Reinforcement Learning (RL) for Reasoning Alignment

Supervised fine-tuning alone was insufficient for shaping this reasoning behavior. Although writing perfect reasoning examples at scale is difficult, humans are good at judging which reasoning is better. This is where Reinforcement learning from human feedback (RLHF) came in:

- Humans ranked model outputs (including reasoning quality)

- A reward signal captured preferences such as clarity, correctness, and logical structure

- The model was optimized to favor outputs humans preferred

This allowed researchers to reinforce reasoning-style behaviors, including clearer CoT patterns.

In the next blog, I will dive deeper into the primary algorithms used in early RLHF systems. For now, I would like to point to a paper about Thinking LLMs if someone is interested to dive deeper about reasoning-focused models and optimizations. It expands on how reasoning behaviors emerge, how they are evaluated, and how future models may further improve structured reasoning. Link in the comment!

What does "Thinking" really mean when you ask an LLM a certain question?