Attention Block
Scaled dot-product attension Each of Q,K,V are linear projection of token embeddings.
Attention(Q,K,V) = softmax(dk​​QKT​)V Multihead Attention
Now the same Attention block is applied mulitple time in parallel, then the result is concatenated.