In the second part of this workshop, we went over classical attention mechanisms (additive and dot-product) and the implementation of attention within the Transformer architecture.
Key papers:
Clarification on the attention mechanism table from slide 14:
The purpose of this table is to compare additive and transformer attention. However, this can only be done without expanding on the source of Keys, Queries and Values from the Transformer attention. This is because the Transformer architecture implements multi-head attention (MHA) in different contexts: cross-attention (or encoder/decoder attention) and self-attention. Feel free to check out the Transformer paper for more info, although we might cover this paper in more detail in a future session! See updated slide below:
