Positional Encoding

Position Encoding

Position encoding allows for adding the positional information to the tokens, this preserves the seqeuencial nature of the token. Because self-attention block are permutation invariant ie order of input tokens are shuffled, the output is still same. Hence we need to inject positional information of each token as well. For that we use position encoding, we calculate potition encoding for dimension $d$ which is same dimension as word embedding, for each position in the sequence. Remeber, that the position encoding is independent of the what token it is, ie two different tokens in the same positon in the sequence will have the same position encoding.

Remember there is no learning happening here. Just a function that maps the position to a vector representation.

Different types of position encoding:

Absolute - Sin and cos function - used in original transformer paper.
Relative
Rotatry

The positional encoding are added to the word embeddings to get the final input to the transofrmer model. But note that we can also just concatenate the position encoding to the word embedding instead of adding them. But this will end up increase the model size - the dimension of vector per token that's input to the model.

Transformer Architecture: The Positional Encoding - Amirhossein Kazemnejad's Blog

Position Embedding

So in case of embedding, they are learned vector representation similar to how we learn word embeddings. Just that in word embedding we use token id to index the table to get corresponding vector, where as in Position embedding we use the position in the sequence to index the table to get corresponding vector.

What is the difference between position embedding vs positional encoding in BERT?Cross Validated

Why not Concatenating Positional Encoding instead of Summing

Why add positional embedding instead of concatenate? · Issue #1591 · tensorflow/tensor2tensorGitHub

Why using sine and cosine

Linear Relationships in the Transformer’s Positional Encoding

PE as bottleneck in increased context length during testing

Let's say you train your LLM with max sequence length of 100. Now if you want to test with sequence length of 200, you can do so but the model might not do well. Because now the positional embedding will be added for extra added tokens which your LLM hasn't seen during training. Hence these extra positional input embedding might cause generalization error when testing on longer sequences.

Rotary Position Embedding

PreviousWord Embedding NextEncoder

Last updated 10 months ago