Reinforcement Learning (RL) is a machine learning paradigm where an Agent learns to make decisions by performing actions in an Environment and receiving feedback in the form of Rewards (positive) or Penalties (negative).
Unlike Supervised Fine-Tuning (SFT), where the model is told exactly what to output, RL tells the model how good its output was and lets it figure out the optimal strategy.
Core Components
- Agent: The learner or decision maker (e.g., the LLM).
- Environment: The world the agent interacts with (e.g., the chat interface, a game, a simulator).
- Action: What the agent does (e.g., generating a token).
- State: The current situation (e.g., the conversation history).
- Reward: A scalar signal indicating success (e.g., +1 for a helpful answer, -1 for a toxic one).
- Policy: The strategy or rule the agent follows (the mapping from State to Action).
Role in LLMs
RL is critical for the Alignment and Reasoning phases of modern LLMs:
1. Alignment (RLHF)
- Goal: To align the model with human values (helpfulness, honesty, harmlessness).
- Method: Reinforcement Learning from Human Feedback (RLHF). A “Reward Model” is trained on human preferences to judge the LLM’s outputs, and the LLM uses algorithms like Proximal Policy Optimization (PPO) to maximize that reward.
2. Reasoning (e.g., DeepSeek R1)
- Goal: To incentivize deep thinking and self-correction.
- Method: The model is rewarded for correct final answers (e.g., in math or code) and for producing valid “Chain of Thought” reasoning steps.
