Pre-training
Pre-training is a technique in machine learning, specifically a form of Transfer Learning, where a model is first trained on a large dataset to learn general features and patterns before being adapted (or fine-tuned) for a specific downstream task.
The core intuition is that it is easier to solve a specific problem if you already have a general understanding of the domain.
Core Mechanism
The workflow typically consists of two stages:
- Pre-training: The model is trained on a massive amount of generic data (e.g., the entire internet, ImageNet) to learn broad representations. This is often the most computationally expensive phase.
- Fine-tuning: The pre-trained “base model” is then updated using a smaller, task-specific dataset to specialize its performance.
Types of Pre-training
1. Unsupervised / Self-Supervised
This is the dominant paradigm for Large Language Model. The model learns from the internal structure of the data without explicit labels.
- Method: Next Word Prediction (Casual Language Modeling) or Masked Language Modeling (like BERT).
- Goal: To learn the statistical probability of language, grammar, and world knowledge.
2. Supervised
Common in older Computer Vision workflows (e.g., ResNet).
- Method: Training on a fully labeled dataset like ImageNet (classifying images into 1000 categories).
- Goal: To learn feature extractors (edges, textures, shapes) that can be transferred to other visual tasks.
Role in Large Language Models
In the context of LLMs, Pre-training is the phase where the model gains its “intelligence” or base capabilities.
- It consumes raw text from diverse sources (books, code, web data).
- It is purely probabilistic: The model learns to complete text plausible, but does not yet follow instructions or act as an assistant.
- The output of this stage is a Foundational Model (or Base Model), which is then refined via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to become a helpful assistant. (Reasoning Model Blueprint (SFT + RL))
