1.3 Single-attention head

Building a context-specific representation of a token.

Transformer-attention

Diagram 1.3.0: The Transformer, Vaswani et al. (2017)

Self-attention purpose

During utilisation, attention regards finding how strongly tokens are interrelated, based on specific types of relations.
For example, in the context of translation, words cannot be translated word-by-word independently, due to differences in sentence structures across languages, such as English’s subject-verb-object pattern:
“The quick brown fox jumped over the lazy dog.”

Compare this to the subject-object-verb pattern found in other languages (e.g. Japanese):
“素早い茶色のキツネが怠け者の犬を飛び降りました”

Note that the が particle marks the subject, を marks the object and precedes the verb, and formal sentences always end in verbs, ました being the formal past tense conjugate.

As a result, the grammatical relations between the words must be comprehended. This can then be further encoded into the token vector.

Self-attention theory

The way that this encoding is done, is that the vector representation of each token, x_i, up to this point (note the arrows in diagram 1.3.0), is multiplied by 3 weight matrices:

$q_i = x_i W^Q$

$k_i = x_i W^K$

$v_i = x_i W^V$

This is repeated for all the vector representations (one representation per token) in the input sequence, which results in matrices Q, K, and V.

Mathematically, in the Vaswani et al. Transformer, a single self-attention mechanism head then calculates attention scores as follows:

attention(Q, K, V) = softmax( \frac {Q \cdot K^{T}} {\sqrt {d_{k}} } + M) \cdot V

Inspecting instead each vector individually (i.e. q_i instead of Q), and looking at only the core calculation of each x_i, and each vector that precedes it, as x_j, attention(Q, K, V) can instead be thought of as multiple instances of the following computation concurrently occuring:

\tag*{j < i + 1} { strength(x_i, x_j) = \frac {\textbf{ q }_i \cdot {\textbf{ k }_j}^{T}} {\sqrt {d_{k}} } }

Pardon the KaTeX +1 hack.

Where strength() regards the strength of the relation between the vector representations (of the tokens), and d_k is the dimensionality of the k vector.
Consider that the dot product between q_i and k_j will indicate how closely two instances of these vectors are related.

You may be wondering: what relation exactly? Well, different values of W^Q, W^K, and W^V will emphasise and deemphasise different parts of vector representations x_i and preceding vector representations x_j. For example, the attention head may be oriented around identifying the subject and verb of the input sequence, via the verb querying the preceding tokens, for example by looking for metadata relating to nouns. The attention scores then emphasise and deemphasise preceding token data, in the form of V.

single-head-attention

Diagram 1.3.1: a non-numerical example of how a weight matrix may extract relevant metadata from vectors.

To scale the calculations better to the hardware available (SIMD processors, i.e. GPUs), the vectors are fit into matrices, and operations are performed concurrently via matrix algebra.

As a result of the above, the matrix M, seen in the attention(Q, K, V) function, becomes relevant. M stands for masking; it causes tokens that proceed q_i to be ignored during matrix algebra, by setting them to an infinitely low negative, which then causes any values for proceeding tokens calculated to be disregarded by the softmax() function. An example of M can be seen in the upcoming step-by-step example.

Self-attention head application

This self-attention mechanism can be broken down into a step-by-step sequence. Note that the numbers in the following example sequence are randomly generated (it is difficult to generate or find examples of specific relations).

Example 1 (click to open)

All the input vectors x_i (such as those output from the positional encoding step), are set into a matrix X; for example 6 input vectors of dimension 4 would be arranged as a 6x4 matrix.


$x_{0}$	0.1	-0.2	0.1	0.2
$x_{1}$	0.8	0.8	-0.9	0.5
$x_{2}$	-0.6	0.4	0.5	-0.9
$x_{3}$	0.9	-0.7	0.8	0.7
$x_{4}$	0.4	0.3	0.4	0.4
$x_{5}$	0.8	0.2	0.9	-0.2

Sample matrix X, above, could represent an input sequence such as the sentence “List new ideas for a song”, one word per x_i row, in the simple case of 4 dimensions per token.

The matrix X is multiplied by 3 weight matrices of equivalent dimensions (for the dimensions in this example: 4x4), one matrix W^Q representing query weights, one matrix W^K representing key weights, and one matrix W^V representing value weights, to generate three new matrices Q, K, and V.

Q = XW^Q

K = XW^K

V = XW^V


3.4	-2.1	0.8	1.9
-1.2	0.5	3.7	-0.8
2.9	1.1	-3.4	0.6
0.7	2.8	1.3	-2.5

Sample matrix W^Q, above.


$q_{0}$	1.01	0.36	-0.74	-0.09
$q_{1}$	-0.50	-0.87	4.49	-0.23
$q_{2}$	2.17	-1.99	-3.92	1.71
$q_{3}$	1.43	2.59	1.97	-3.19
$q_{4}$	0.67	0.98	0.61	-0.31
$q_{5}$	1.97	0.52	4.01	-0.51

Sample matrix Q, above.

The dot product of each query vector q_i (a query vector being a row of Q) with every key ki via transposed KT is calculated, to generate attention scores; $\ QK^{T}$ . Higher attention scores between a given q_i and a given k_i indicaŧe that the query q_i is more similar to the given k_i, and therefore that there is a stronger relation.

attention(Q, K) = QK^T


0.15	1.94	-0.98	2.53	1.24	2.18
0.16	1.44	0.64	-2.33	0.81	0.44
0.25	0.35	1.25	2.24	1.14	3.25
0.28	1.15	-0.81	1.93	-0.81	-0.28

Sample matrix K^T, above.


-0.01	2.12	-1.62	3.45	1.83	3.29
-1.37	0.51	7.15	-3.39	2.35	2.99
3.49	-6.19	-9.39	11.19	-3.19	-5.67
2.23	3.19	2.35	-0.67	1.35	4.37
0.83	1.42	0.95	0.27	0.69	1.83
3.35	1.35	10.93	-0.95	3.35	5.95

Sample matrix QK^T, above.

These attention scores are then normalised by dividing by the square root of the dimension being used (in this example, 4); $\frac {QK^{T}}{\sqrt{d_{k}}}$ .


-0.005	1.06	-0.81	1.725	0.915	1.645
-0.685	0.255	3.575	-1.695	1.175	1.495
1.745	-3.095	-4.695	5.595	-1.595	-2.835
1.115	1.595	1.175	-0.335	0.675	2.185
0.415	0.71	0.475	0.135	0.345	0.915
1.675	0.675	5.465	-0.475	1.675	2.975

Sample matrix QK^T, above, where all cells have been divided by √4.

In the decoder, a masking matrix M is used to force infinitely negative attention scores between tokens being queried and future tokens, such that only attention scores between a given token and tokens previously generated are used; $\frac {QK^{T}}{\sqrt{d_{k}}} + M$ .


-∞	-∞	-∞	-∞	-∞
0	-∞	-∞	-∞	-∞
0	0	-∞	-∞	-∞
0	0	0	-∞	-∞
0	0	0	0	-∞
0	0	0	0	0

Sample matrix M, above.

Softmax is run on the attention scores, once per column, nullifying the infinitely negative scores, and generating a probability distribution from the attention scores; $softmax ( \frac {QK^{T}}{\sqrt{d_{k}}} +M)$ .


0.156	0	0	0	0	0
0.138	0.041	0	0	0	0
0.194	0.001	0	0	0	0
0.168	0.193	0.042	0.143	0	0
0.142	0.136	0.031	0.051	0.083	0
0.201	0.629	0.927	0.806	0.917	1

Sample matrix after softmax() has been applied, note that the cells in each column add up to 1.

The probability distribution is then applied to the values within V to find the input tokens with the largest influence on the token under focus; $softmax( \frac {QK^{T}} {\sqrt {d_{k}} } + M) V$ ).


0.43	0.29	0.64	0.83
1.15	1.19	-0.28	0.67
-0.21	0.64	0.59	-0.82
1.19	-0.45	0.86	0.99
0.51	0.41	0.51	0.51
1.19	0.28	0.95	-0.15

Sample matrix V, above.


$head_{0}$	0.06708	0.04524	0.09984	0.12948
$head_{1}$	0.15874	0.15766	0.05938	0.08634
$head_{2}$	0.08382	0.05966	0.11566	0.13938
$head_{3}$	0.26734	0.06388	0.16438	0.20534
$head_{4}$	0.15538	0.13636	0.08656	0.09556
$head_{5}$	1.05279	0.68873	1.05273	0.82373

Sample matrix of final output of one self-attention head; at this point in the Transformer, each token now has both positional data and data regarding relations to other tokens, embedded into one vector.

More algebraic, implicit examples of the matrix transformations are available.^[1]

Note that, in LLM implementations, dimensionalities may be less consistent than the above example (Llama-4 provides optional hyperparameters).^[2][3]

References

[1] Transformers: a Primer - Columbia University
[2] Speech and Language Processing, Chapter 8; Transformers - Stanford
[3] Llama-4 documentation - HuggingFace

1.2 Positional encoding 1.4 Multi-attention Head