Positional Encoding
Last updated
Last updated
Position Encoding
Position encoding allows for adding the positional information to the tokens, this preserves the seqeuencial nature of the token. Because self-attention block are permutation invariant ie order of input tokens are shuffled, the output is still same. Hence we need to inject positional information of each token as well. For that we use position encoding, we calculate potition encoding for dimension which is same dimension as word embedding, for each position in the sequence. Remeber, that the position encoding is independent of the what token it is, ie two different tokens in the same positon in the sequence will have the same position encoding.
Remember there is no learning happening here. Just a function that maps the position to a vector representation.
Different types of position encoding:
Absolute - Sin and cos function - used in original transformer paper.
Relative
Rotatry
The positional encoding are added to the word embeddings to get the final input to the transofrmer model. But note that we can also just concatenate the position encoding to the word embedding instead of adding them. But this will end up increase the model size - the dimension of vector per token that's input to the model.
Position Embedding
So in case of embedding, they are learned vector representation similar to how we learn word embeddings. Just that in word embedding we use token id to index the table to get corresponding vector, where as in Position embedding we use the position in the sequence to index the table to get corresponding vector.
Let's say you train your LLM with max sequence length of 100. Now if you want to test with sequence length of 200, you can do so but the model might not do well. Because now the positional embedding will be added for extra added tokens which your LLM hasn't seen during training. Hence these extra positional input embedding might cause generalization error when testing on longer sequences.