The EndOfText Token, represented as <|endoftext|>, is a special token used to mark the boundary between unrelated text sources (e.g., different documents or books) during training.
Notes:
- “when we are working with multiple text sources we typically add end of text token between the text”.
- It acts as a marker signaling the start or end of a particular segment.
- Prevents the LLM from mixing up contexts of independent documents.
- Token ID: In the GPT-2 tokenizer, it is assigned the token ID
50256.
