Last updated 8 months ago
Encoder is basically:
Multi-head attention
Add and Norm
Feed forward network - 2 linear layers with relu in between
This encoder is repeated 6 times in the original paper.