Transformer Architecture

Transformer architecture

Attention has been used in:

1. Idenitfying parts to attend to is similar to Search problem

Figure 1: Attention as Search

Goal: Identify and attend to most important features in input

We want to elimintate recurrence because that what gave rise to the limitations. So, we need to encode position information

Figure 2: Position-Aware Encoding (@ 0:48:32)
Extract, query, key, value for search
- Multiply the positional encoding with three matrices to get query, key and value encoding for each word
Compute attention weighting (A matix of post-softmax attention scores)
- Compute pairwise similarity between each query and key => Dot Product (0:51:01)
  
  Attention Score = $\frac{Q . K^{T}}{s c a l i n g}$
- Apply softmax to the attention score to get value in $[0, 1]$
Extract features with high attention: Multiply attention weighting with Value.

Figure 3: Self-Attention Head