Batch Size refers to the number of data samples processed by the model in one iteration before updating its internal parameters (weights).
Implementation
In LLM training, data is processed in batches (e.g., 4, 8, 32 sequences at a time) rather than one by one or all at once.
- Batch Size = 1: Updates parameters after every single sample. High noise, slow training.
- Batch Size > 1: More stable gradient estimates, better utilization of Parallel Computing resources (GPUs).
Trade-offs
- Memory: Larger batch sizes require more VRAM.
- Speed: Larger batches are generally faster per epoch due to parallelism.
- Noise: Smaller batches introduce more noise, which can sometimes help generalization but makes training unstable.
