UP | HOME

Date: [2023-03-17 Fri]

Transformer Architecture

Table of Contents

See RNN and Transformers (MIT 6.S191 2022) for link to video lecture.

Transformer architecture

Attention has been used in:

1. Idenitfying parts to attend to is similar to Search problem

  • Enter a Query (\(Q\)) for search
  • Extract key information \(K_i\) for each search result
  • Compute how similar is the key to the query: Attention Mask
  • Extract required information from the search i.e. Value \(V\)

attention_as_search-20230316105659.png

Figure 1: Attention as Search

2. Self-Attention in Sequence Modelling

Goal: Identify and attend to most important features in input

  1. We want to elimintate recurrence because that what gave rise to the limitations. So, we need to encode position information

    position_aware_encoding-20230316113706.png

    Figure 2: Position-Aware Encoding (@ 0:48:32)

  2. Extract, query, key, value for search
    • Multiply the positional encoding with three matrices to get query, key and value encoding for each word
  3. Compute attention weighting (A matix of post-softmax attention scores)
    • Compute pairwise similarity between each query and key => Dot Product (0:51:01)

      Attention Score = \(\frac {Q . K^T} {scaling}\)

    • Apply softmax to the attention score to get value in \([0, 1]\)
  4. Extract features with high attention: Multiply attention weighting with Value.

self_attention_head-20230316114501.png

Figure 3: Self-Attention Head


Backlinks


You can send your feedback, queries here