EndOfText Token

The EndOfText Token, represented as <|endoftext|>, is a special token used to mark the boundary between unrelated text sources (e.g., different documents or books) during training.

Notes:

“when we are working with multiple text sources we typically add end of text token between the text”.
It acts as a marker signaling the start or end of a particular segment.
Prevents the LLM from mixing up contexts of independent documents.
Token ID: In the GPT-2 tokenizer, it is assigned the token ID 50256.

Chat with Mike 3.0