Reasoning Model Blueprint (SFT + RL)

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is considered the “blueprint” for building robust reasoning models, such as OpenAI’s o1 and the final version of DeepSeek R1.

The Process

It combines two major training phases:

  1. Supervised Fine-Tuning (SFT): Training the model on high-quality examples of reasoning (input-output pairs where the “thought process” is provided).
  2. Reinforcement Learning (RL): Further refining the model using reward signals to encourage correct reasoning and self-correction.

This hybrid approach allows the model to learn from human examples (SFT) and then exceed human capabilities or self-optimize through trial and error (RL).

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: