Tokenziation

Tokenization is a way to break down a string of continuous text into small individual chunks called tokens, each token can be a word/character/sub-word.

Different tokenization schemes can be used based on different aspects.

So let's say your tokenizer divided the whole text corpus into TT tokens i.e. there are your text is divided into total TTchunks of smaller words, characters, sub-words, etc. Now this TT becomes the Vocabulary size of your model.

Each token has an token_id ∈[1,T]\in [1, T].

Last updated