Multi-head Attention Block

Attention Block

Each of $Q,K,V$ are linear projection of token embeddings.

\text{Attention(Q,K,V) = \text{softmax}}(\frac{QK^T}{\sqrt{d_k}})V

Now the same Attention block is applied mulitple time in parallel, then the result is concatenated.

Last updated 1 year ago