Multi-head Attention Block

Attention Block

Each of Q,K,VQ,K,V are linear projection of token embeddings.

Attention(Q,K,V) = softmax(QKTdk)V\text{Attention(Q,K,V) = \text{softmax}}(\frac{QK^T}{\sqrt{d_k}})V

Multihead Attention

Now the same Attention block is applied mulitple time in parallel, then the result is concatenated.

Last updated