Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is considered the “blueprint” for building robust reasoning models, such as OpenAI’s o1 and the final version of DeepSeek R1.
The Process
It combines two major training phases:
- Supervised Fine-Tuning (SFT): Training the model on high-quality examples of reasoning (input-output pairs where the “thought process” is provided).
- Reinforcement Learning (RL): Further refining the model using reward signals to encourage correct reasoning and self-correction.
This hybrid approach allows the model to learn from human examples (SFT) and then exceed human capabilities or self-optimize through trial and error (RL).
