Positional Encoding

Position Encoding

Remember there is no learning happening here. Just a function that maps the position to a vector representation.

Different types of position encoding:

  • Absolute - Sin and cos function - used in original transformer paper.

  • Relative

  • Rotatry

The positional encoding are added to the word embeddings to get the final input to the transofrmer model. But note that we can also just concatenate the position encoding to the word embedding instead of adding them. But this will end up increase the model size - the dimension of vector per token that's input to the model.

Position Embedding

So in case of embedding, they are learned vector representation similar to how we learn word embeddings. Just that in word embedding we use token id to index the table to get corresponding vector, where as in Position embedding we use the position in the sequence to index the table to get corresponding vector.

Why not Concatenating Positional Encoding instead of Summing

Why using sine and cosine

PE as bottleneck in increased context length during testing

Let's say you train your LLM with max sequence length of 100. Now if you want to test with sequence length of 200, you can do so but the model might not do well. Because now the positional embedding will be added for extra added tokens which your LLM hasn't seen during training. Hence these extra positional input embedding might cause generalization error when testing on longer sequences.

Rotary Position Embedding

Last updated