Tokenziation
Last updated
Last updated
Tokenization is a way to break down a string of continuous text into small individual chunks called tokens, each token can be a word/character/sub-word.
Different tokenization schemes can be used based on different aspects.
So let's say your tokenizer divided the whole text corpus into tokens i.e. there are your text is divided into total chunks of smaller words, characters, sub-words, etc. Now this becomes the Vocabulary size of your model.
Each token has an token_id
.