1.3 Single-attention head
Building a context-specific representation of a token.
Diagram 1.3.0: The Transformer, Vaswani et al. (2017)
Self-attention purpose
During utilisation, attention regards finding how strongly tokens are interrelated, based on specific types of relations.
For example, in the context of translation, words cannot be translated word-by-word independently, due to differences in sentence structures across languages, such as English’s subject-verb-object pattern:
“The quick brown fox jumped over the lazy dog.”
Compare this to the subject-object-verb pattern found in other languages (e.g. Japanese): “素早い茶色のキツネが怠け者の犬を飛び降りました”
Note that the が particle marks the subject, を marks the object and precedes the verb, and formal sentences always end in verbs, ました being the formal past tense conjugate.
As a result, the grammatical relations between the words must be comprehended. This can then be further encoded into the token vector.
Self-attention theory
The way that this encoding is done, is that the vector representation of each token, xi, up to this point (note the arrows in diagram 1.3.0), is multiplied by 3 weight matrices:
This is repeated for all the vector representations (one representation per token) in the input sequence, which results in matrices Q, K, and V.
Mathematically, in the Vaswani et al. Transformer, a single self-attention mechanism head then calculates attention scores as follows:
Inspecting instead each vector individually (i.e. qi instead of Q), and looking at only the core calculation of each xi, and each vector that precedes it, as xj, attention(Q, K, V) can instead be thought of as multiple instances of the following computation concurrently occuring:
Pardon the KaTeX +1 hack.
Where strength() regards the strength of the relation between the vector representations (of the tokens), and dk is the dimensionality of the k vector. Consider that the dot product between qi and kj will indicate how closely two instances of these vectors are related.
You may be wondering: what relation exactly? Well, different values of WQ, WK, and WV will emphasise and deemphasise different parts of vector representations xi and preceding vector representations xj. For example, the attention head may be oriented around identifying the subject and verb of the input sequence, via the verb querying the preceding tokens, for example by looking for metadata relating to nouns. The attention scores then emphasise and deemphasise preceding token data, in the form of V.
Diagram 1.3.1: a non-numerical example of how a weight matrix may extract relevant metadata from vectors.
To scale the calculations better to the hardware available (SIMD processors, i.e. GPUs), the vectors are fit into matrices, and operations are performed concurrently via matrix algebra.
As a result of the above, the matrix M, seen in the attention(Q, K, V) function, becomes relevant. M stands for masking; it causes tokens that proceed qi to be ignored during matrix algebra, by setting them to an infinitely low negative, which then causes any values for proceeding tokens calculated to be disregarded by the softmax() function. An example of M can be seen in the upcoming step-by-step example.
Self-attention head application
This self-attention mechanism can be broken down into a step-by-step sequence. Note that the numbers in the following example sequence are randomly generated (it is difficult to generate or find examples of specific relations).
Example 1 (click to open)
- All the input vectors xi (such as those output from the positional encoding step), are set into a matrix X; for example 6 input vectors of dimension 4 would be arranged as a 6x4 matrix.
| 0.1 | -0.2 | 0.1 | 0.2 | |
| 0.8 | 0.8 | -0.9 | 0.5 | |
| -0.6 | 0.4 | 0.5 | -0.9 | |
| 0.9 | -0.7 | 0.8 | 0.7 | |
| 0.4 | 0.3 | 0.4 | 0.4 | |
| 0.8 | 0.2 | 0.9 | -0.2 |
Sample matrix X, above, could represent an input sequence such as the sentence “List new ideas for a song”, one word per xi row, in the simple case of 4 dimensions per token.
- The matrix X is multiplied by 3 weight matrices of equivalent dimensions (for the dimensions in this example: 4x4), one matrix WQ representing query weights, one matrix WK representing key weights, and one matrix WV representing value weights, to generate three new matrices Q, K, and V.
Q = XWQ
K = XWK
V = XWV
| 3.4 | -2.1 | 0.8 | 1.9 |
| -1.2 | 0.5 | 3.7 | -0.8 |
| 2.9 | 1.1 | -3.4 | 0.6 |
| 0.7 | 2.8 | 1.3 | -2.5 |
Sample matrix WQ, above.
| 1.01 | 0.36 | -0.74 | -0.09 | |
| -0.50 | -0.87 | 4.49 | -0.23 | |
| 2.17 | -1.99 | -3.92 | 1.71 | |
| 1.43 | 2.59 | 1.97 | -3.19 | |
| 0.67 | 0.98 | 0.61 | -0.31 | |
| 1.97 | 0.52 | 4.01 | -0.51 |
Sample matrix Q, above.
- The dot product of each query vector qi (a query vector being a row of Q) with every key ki via transposed KT is calculated, to generate attention scores; . Higher attention scores between a given qi and a given ki indicaŧe that the query qi is more similar to the given ki, and therefore that there is a stronger relation.
attention(Q, K) = QKT
| 0.15 | 1.94 | -0.98 | 2.53 | 1.24 | 2.18 |
| 0.16 | 1.44 | 0.64 | -2.33 | 0.81 | 0.44 |
| 0.25 | 0.35 | 1.25 | 2.24 | 1.14 | 3.25 |
| 0.28 | 1.15 | -0.81 | 1.93 | -0.81 | -0.28 |
Sample matrix KT, above.
| -0.01 | 2.12 | -1.62 | 3.45 | 1.83 | 3.29 |
| -1.37 | 0.51 | 7.15 | -3.39 | 2.35 | 2.99 |
| 3.49 | -6.19 | -9.39 | 11.19 | -3.19 | -5.67 |
| 2.23 | 3.19 | 2.35 | -0.67 | 1.35 | 4.37 |
| 0.83 | 1.42 | 0.95 | 0.27 | 0.69 | 1.83 |
| 3.35 | 1.35 | 10.93 | -0.95 | 3.35 | 5.95 |
Sample matrix QKT, above.
- These attention scores are then normalised by dividing by the square root of the dimension being used (in this example, 4); .
| -0.005 | 1.06 | -0.81 | 1.725 | 0.915 | 1.645 |
| -0.685 | 0.255 | 3.575 | -1.695 | 1.175 | 1.495 |
| 1.745 | -3.095 | -4.695 | 5.595 | -1.595 | -2.835 |
| 1.115 | 1.595 | 1.175 | -0.335 | 0.675 | 2.185 |
| 0.415 | 0.71 | 0.475 | 0.135 | 0.345 | 0.915 |
| 1.675 | 0.675 | 5.465 | -0.475 | 1.675 | 2.975 |
Sample matrix QKT, above, where all cells have been divided by √4.
- In the decoder, a masking matrix M is used to force infinitely negative attention scores between tokens being queried and future tokens, such that only attention scores between a given token and tokens previously generated are used; .
| 0 | -∞ | -∞ | -∞ | -∞ | -∞ |
| 0 | 0 | -∞ | -∞ | -∞ | -∞ |
| 0 | 0 | 0 | -∞ | -∞ | -∞ |
| 0 | 0 | 0 | 0 | -∞ | -∞ |
| 0 | 0 | 0 | 0 | 0 | -∞ |
| 0 | 0 | 0 | 0 | 0 | 0 |
Sample matrix M, above.
- Softmax is run on the attention scores, once per column, nullifying the infinitely negative scores, and generating a probability distribution from the attention scores; .
| 0.156 | 0 | 0 | 0 | 0 | 0 |
| 0.138 | 0.041 | 0 | 0 | 0 | 0 |
| 0.194 | 0.001 | 0 | 0 | 0 | 0 |
| 0.168 | 0.193 | 0.042 | 0.143 | 0 | 0 |
| 0.142 | 0.136 | 0.031 | 0.051 | 0.083 | 0 |
| 0.201 | 0.629 | 0.927 | 0.806 | 0.917 | 1 |
Sample matrix after softmax() has been applied, note that the cells in each column add up to 1.
- The probability distribution is then applied to the values within V to find the input tokens with the largest influence on the token under focus; ).
| 0.43 | 0.29 | 0.64 | 0.83 |
| 1.15 | 1.19 | -0.28 | 0.67 |
| -0.21 | 0.64 | 0.59 | -0.82 |
| 1.19 | -0.45 | 0.86 | 0.99 |
| 0.51 | 0.41 | 0.51 | 0.51 |
| 1.19 | 0.28 | 0.95 | -0.15 |
Sample matrix V, above.
| 0.06708 | 0.04524 | 0.09984 | 0.12948 | |
| 0.15874 | 0.15766 | 0.05938 | 0.08634 | |
| 0.08382 | 0.05966 | 0.11566 | 0.13938 | |
| 0.26734 | 0.06388 | 0.16438 | 0.20534 | |
| 0.15538 | 0.13636 | 0.08656 | 0.09556 | |
| 1.05279 | 0.68873 | 1.05273 | 0.82373 |
Sample matrix of final output of one self-attention head; at this point in the Transformer, each token now has both positional data and data regarding relations to other tokens, embedded into one vector.
More algebraic, implicit examples of the matrix transformations are available.[1]
Note that, in LLM implementations, dimensionalities may be less consistent than the above example (Llama-4 provides optional hyperparameters).[2][3]
References
[1] Transformers: a Primer - Columbia University [2] Speech and Language Processing, Chapter 8; Transformers - Stanford [3] Llama-4 documentation - HuggingFace