Last updated 8 months ago
Each of Q,K,VQ,K,VQ,K,V are linear projection of token embeddings.
Now the same Attention block is applied mulitple time in parallel, then the result is concatenated.