Introduction to the Transformer and modern LLMs
By Thomas Prior
LLM overview
LLM stands for Large Language Model. An LLM is made up of a set of functions and parameters connected together, such that outputs of functions are then used as inputs to others, that can give coherent, natural language output, after being prompted by an initial input, such as a question or instruction.
LLMs are generally developed by well-funded companies, who are able to spend large quantities of money on data and computational power to develop competitive models. Examples of LLMs include Llama-4 from Meta, ChatGPT-4 from OpenAI, or Gemini 3 from Google.
There are a few major stages within the lifetime of an LLM:
- Development - large quantities of data are run through the functions that make up the LLM, and the parameters are optimised so that inputs give expected output
- Adaptation - deploying more specific methods to fine-tune a model for specific use-cases
- Utilisation - invoking output from the trained model, for practical uses, by crafting prompts
Generally, development of a competitive LLM is outside of the budget of most companies - DeepSeek, a Chinese competitor to OpenAI, is said to have built the LLM DeepSeek-R1 for $6M USD. However, businesses across a large variety of industries increasingly benefit from the automation abilities of the utilisation stage that LLMs offer, such as for customer service, content creation, or information retrieval purposes.
Note that LLMs are prone to acting on predictions, even if reasoning is not substantive for them, and so LLMs are regarded as experimental.
Take the following murder mystery I invented, for example:
8 guests resided in a rented house on the night 12th July 2015 - Margaret, Joe, James, Kyle, Carter, Judith, Janice, and Jillary. Margaret and Joe went out to dinner at 7pm, their attendance at a local restaurant confirmed by restaurant staff and a local couple, and left at 10pm. James and Kyle state that they watched an action movie in the attic of the house at 8pm, which was confirmed via their streaming account. Carter and Judith retired to their en-suite room at 9pm for the rest of the night. Janice went out for a moonlit walk along the riverside, and is confirmed to have spoken with some locals walking their dog. Jillary was found dead on the morning of 13th July 2015, killed at gunpoint. An unsilenced weapon was found by her bedside, with no fingerprints on it. No guests state that they heard a gunshot. Who was the murderer?
The responses from various LLMs are as follows:
- Llama-4 ruminated through a logical 12-step process then quickly blamed James, without explanation
- ChatGPT-4 blamed Janice due an ambiguous alibi
- ChatGPT-5 blamed Carter or Judith, but specifically Carter (allegedly due to the subtlety)
- Gemini 2.5 Flash suspected suicide, an unexpected plot twist
- Gemini 2.5 Pro suspected Carter and Judith due to the decibels of an unsilenced gunshot which was claimed unheard
Note that I did not have any clear perpetrator in mind at the time of writing this murder mystery, though I do have awareness of statistical differences between crime rates of different demographics (e.g. male and female).
Furthermore, once trained, LLMs are only able to retrieve data available during the training stage reliably; newer data is unreachable. For example, ask an LLM what the current UK inflation rate is:
The current inflation rate in the UK for August 2025 is projected to be around 4%. This figure is based on trends observed in previous months, where inflation rose to 3.6% in June 2025, up from 3.4% in May.
- ChatGPT-4
This is verifiably inaccurate; as of August 2025, the Bank of England official website states the UK inflation rate to be 3.6%, ~10% less than the number provided.
However, ask Llama-4 something more general, e.g. “What are common problems among Computer Science departments (within universities)?” and we get a list of very coherent and topical responses:
- Keeping the curriculum up-to-date with rapidly evolving industry trends and technologies
- Meeting accreditation standards and requirements
- High student-to-faculty ratios, leading to limited individual attention and support
- Creating a welcoming and inclusive environment for underrepresented groups
- Encouraging collaboration between CS departments and other disciplines
This text attempts to give an overview of the theory behind how the major pieces of an LLM works, without getting too caught up in the current most popular implementation details (tensors, Python, etc.).
1. Internal mechanisms of an LLM
On the topic of LLM architecture, a paper frequently cited is Vaswani et al. (2017). This paper pioneered the concept of a model known as a Transformer, which is currently the state-of-the-art building block of sequence-to-sequence modelling tasks.
Diagram 1.0: The Transformer, Vaswani et al. (2017)
The Transformer is made up of an encoder (left side of diagram 1) and a decoder (right side of diagram 1).
The encoder takes an input sequence and converts this into a contextual memory. The decoder takes the contextual memory and generates an output sequence.
This section will delve into the internals of a Transformer, from a bottom-up perspective.
1.1. Input Embedding - converting natural language to numerical representations
The inputs to the encoder are first split into tokens. A token may consist of a whole word, or a portion of a word.
These tokens are then mapped to vector representations (a numerical representation of a token). An example of a vector representation would be a word embedding. A word embedding represents tokens in a dense way, such that similar tokens show high cosine similarity when compared (in vector representation). This dense representation also allows models to comprehend a large quantity of different tokens.
Word embeddings may be developed via:
- neural networks and large quantities of unlabeled training data, by optimising a loss function based on tokens that are expected to be close together (e.g. Word2Vec from Google, 2013)
- global co-occurrence statistics and subword information (e.g. GloVe from Pennington et al. 2014)
Modern word embeddings are typically of very high dimension, e.g. Meta’s LLM Llama-3 uses 4096 dimensions for larger models.
An example of a 16-dimension word embedding, which represents the word ‘book’, could be:
| 0.02 | 0.15 | 0.31 | 0.08 | -0.26 | 0.21 | -0.58 | 1.12 | -0.38 | -0.91 | 0.52 | 0.87 | -0.17 | 0.73 | -0.38 | 0.18 |
|---|
Mapping a token to a word embedding is generally a deterministic process, including in the context of Llama-3.
Representing data as vectors is a method that can be applied to a variety of contexts, and so is independently a large domain even outside the context of LLMs. For example, representations may be utilised to find similarities between documents, for classification or information retrieval purposes.
1.2 Positional Encoding - incorporating position related data to word embeddings
As a Transformer based LLM processes the aforementioned tokens in parallel, a method is needed to embed the relative positional data of the tokens into vector representations. The original Vaswani et al. (2017) paper proposes using the sine and cosine functions as a means of representing positions, alternately for odd and even positions.
Positional encodings are then generated via the following functions:
P(k, 2i)=sin(kn2i/d)
P(k, 2i+1)=cos(kn2i/d)
k: the position of the object within the input sequence
d: set to the same value as the dimension of the word embeddings to be used
n: scalar, set to 10,000 in Vaswani et al. (2017)
i: the output position in regards to the final positional encoding vector that is to be output for a specific token, whereby each set of adjacent odd and even position values are set to the same i, such that 0i<d/2
So, for example, generating 16-dimension positional encodings for the input sequence “classify the book”, may look like the following, with n set to 100:
| i | 0 | 0 | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| classify k = 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| the k = 2 | 0.84 | 0.54 | 0.53 | 0.85 | 0.31 | 0.95 | 0.18 | 0.98 | 0.10 | 1.00 | 0.06 | 1.00 | 0.03 | 1.00 | 0.02 | 1.00 |
| book k = 3 | 0.91 | -0.42 | 0.90 | 0.41 | 0.59 | 0.81 | 0.35 | 0.94 | 0.20 | 0.98 | 0.11 | 0.99 | 0.06 | 1.00 | 0.04 | 1.00 |
These positional encodings are then added to the word embeddings, for example:
| 0.02 | 0.15 | 0.31 | 0.08 | -0.26 | 0.21 | -0.58 | 1.12 | -0.38 | -0.91 | 0.52 | 0.87 | -0.17 | 0.73 | -0.38 | 0.18 |
|---|
+
| 0.91 | -0.42 | 0.90 | 0.41 | 0.59 | 0.81 | 0.35 | 0.94 | 0.20 | 0.98 | 0.11 | 0.99 | 0.06 | 1.00 | 0.04 | 1.00 |
|---|
=
| 0.93 | -0.27 | -0.59 | -.033 | -0.85 | -0.6 | -0.93 | 0.18 | -0.58 | -1.89 | 0.41 | -0.12 | -0.23 | -0.27 | -0.42 | -0.82 |
|---|
This final result, above, is then input into the encoder of the Transformer.
The reasons that positions are encoded via sinusoidal functions are as follows:
- Sinusoidal functions allow a relative position encoding P+K to derived given the encoding for a specific position P
- Sinusoidal functions of alternating wavelengths map relatively closely to binary numbers (see diagram 2, 3), therefore allowing uniqueness (reductionistically, binary numbers are the simplest unique representation)
- Sinusoidal functions are of infinite length, allowing them to be adapted for any input sequence length
- Sinusoidal functions can maintain an output within a limited range, that does not introduce issues with gradient descent during optimisation via loss function
- Sinusoidal functions can be quickly computed, in a deterministic way, allowing efficiency and dependability
Diagram 2: representing the number 150 in binary format.
Diagram 3: representing the output of sinusoidal functions based on an input of 150 (i = 150? pos = 150?). To redo.
1.3 Multi-attention head - finding relations between tokens
During utilisation, attention regards finding how strongly tokens are interrelated, based on specific types of relations.
For example, in the context of translation, words cannot be translated word-by-word independently, due to differences in sentence structures across languages, such as English’s subject-verb-object pattern compared to the subject-object-verb pattern found in Japanese. Instead, the grammatical relations between the words must first be understood.
Mathematically, a single self-attention mechanism head calculates attention scores as follows:
attention(Q, K, V)=softmax(QKTdk+M)V
Note the use of the dot product, which mathematically tends to feature in vector manipulations, because the inputs (tokens) are essentially being represented as vectors (in matrix Q) in potentially a dimension outside of reality (e.g. 2048 dimensions), and then compared to the metadata of the other tokens (in matrix K) also stored as vectors of the same dimension. The result of the dot product will then indicate the best matching token meanings (represented as numbers, in matrix V).
This self-attention mechanism can be broken down into a step-by-step sequence:
-
All the input vectors xi (such as those output from the positional encoding step), and set into a matrix X; for example 6 input vectors of dimension 4 would be arranged as a 6x4 matrix.
-
The matrix X is multiplied by 3 weight matrices of equivalent dimensions (for the dimensions in this example: 4x4), one matrix WQ representing query weights, one matrix WK representing key weights, and one matrix WV representing value weights, to generate three new matrices Q, K, and V.
| x1 | ||||
|---|---|---|---|---|
| x2 | ||||
| x3 | ||||
| x4 | ||||
| x5 | ||||
| x6 |
Matrix X above could represent an input such as the sentence “List new ideas for a song”, one word per xi row, in the simple case of 4 dimensions per token.
Matrix WK
| k1 | ||||
|---|---|---|---|---|
| k2 | ||||
| k3 | ||||
| k4 | ||||
| k5 | ||||
| k6 |
- The dot product of each query vector qi (a query vector being a row of Q) with every key ki via transposed KT is calculated, to generate attention scores; QKT.
Higher attention scores between a given qi and a given ki indicaŧe that the query qi is more similar to the given ki, and therefore that there is a stronger relation.
Matrix QKT
-
These attention scores are then normalised by dividing by the square root of the dimension being used (in this case example, 4); QKTdk.
-
In the decoder, a masking matrix M is used to force infinitely negative attention scores between tokens being queried and future tokens, such that only attention scores between a given token and tokens previously generated are used; QKTdk+M.
| -∞ | |||||
| -∞ | -∞ | ||||
| -∞ | -∞ | -∞ | |||
| -∞ | -∞ | -∞ | -∞ | ||
| -∞ | -∞ | -∞ | -∞ | -∞ |
Matrix … in the case of
- Softmax is run on the attention scores, nullifying the infinitely negative scores, and generating a probability distribution from the attention scores; softmax(QKTdk+M).
After softmax() is applied, in the resulting matrix the cells in each column will add up to 1
- The probability distribution is then applied to the values within V to find the input tokens with the largest influence on the token under focus.
| v1 | ||||
|---|---|---|---|---|
| v2 | ||||
| v3 | ||||
| v4 | ||||
| v5 | ||||
| v6 |
Above: final matrix dimension
Notes regarding the above example matrices:
- a matrix WK is of size 4x4 means that there are 4 keys stored within the self-attention mechanism, each with a level of detail of dimension 4
- approximately all the cells have been left blank, because although every cell would contain a number in a working implementation, the numbers hold no meaning to humans, and so they have been omitted for readability
- in modern LLM implementations, the dimensions of the key, value, and initial inputs to the Transformer may differ in dimensionality
More algebraic, implicit examples of the matrix transformations are available at:
https://www.columbia.edu/~jsl2239/transformers.html
https://web.stanford.edu/~jurafsky/slp3/9.pdf
The multi-head attention layer works largely similarly to a single self-attention as above, but the multiple instances of the heads mean that there are multiple instances of the WQ, WK, and WV matrices, such that the weights within each set of (WQ, WK, WV) possess different focuses.
For example, one self-attention head may be oriented around finding grammatical relations between tokens, a different head may focus on finding tense-based relations between tokens, whilst a different head may focus on syntactic relations between tokens.
As a more specific example, a Q matrix in one head may be focusing on whether there are adjectives/adverbs relating to each token within the input sequence (presuming WQ was trained with such a focus).
Transformers are typically trained end-to-end (all layers trained at once), using specific word-oriented problems. So, before training, the weights within the matrices of a self-attention head - WQ, WK, WV - all initialised at random, then specific training tasks and loss functions are used to refine all parameters within the model, including the weights within each self-attention mechanism. These loss functions are covered in LLM Deep Dive from Springer.
Modern GPUs are able to derive attention scores for one input sequence across multiple heads, simultaneously. This means, for an input sequence n, there may be O(n2) matrices output, each matrix signifying the relations between one token and the other tokens within the input sequence.
1.5 Add & Norm - maintaining consistency between training and utilisation stages
There is a problem known as “covariate shift” in which the data in the training environment is significantly different enough to the data in the utilisation environment, such that the underlying distribution of each is not equivalent, which can decrease the accuracy of the LLM’s predictions. The ‘add & norm’ layer attempts to address this in the following ways.
The ‘add’ within ‘add & norm’ refers to a residual connection that adds the input of each layer to the output; f(x)+x.
To understand the purpose of the residual connection, we first need to understand the ‘vanishing gradient problem’.
Consider backpropagation: the product of a series of partial derivatives expresses how a loss function can change in respect to a specific weight, via an intermediate mix of activation functions and linear regression functions. These partial derivatives feature numerical products from the following sources:
- differences between the model output and the expected output during supervised training
- trained weights
- initial inputs to the model
All of the above may be within the range -1x1 and, therefore, as the quantity of the model’s layers increases, the products of the above may become very small. Therefore, when training a specific weight via backpropagation, the final partial derivative Lwi may be very small, and, therefore, running gradient descent to train weights may become extremely slow; wi+1=wi-n∇L(wi), where ∇L(wi)0, wi is a given weight within an MLP, and L(wi) represents the loss function with respect to the given weight.
Seemingly, other layers in a Transformer may also suffer from similar issues when training, as layers increase and products of derivatives become small.
The idea of implementing residual connections to an MLP (a.k.a. skip connections), first formally occurred in He et. al (2016). If a layer within a Transformer is represented as f(x), for example the multi-head attention layer, where x is the input to the multi-head attention, a residual connection can be represented as f(x) + x.
There are also said to be advantages in terms of retaining original input data, via residual connections, such as whilst passing through layers that need not change it, for example in the following cases:
- some inputs follow a linear distribution whilst passing through an MLP layer, whilst the model is designed to split up the data in an unneeded way
- an input sequence is randomly generated, and therefore there is no relation to be deduced between the tokens during the multi-head attention layer
The ’norm’ within ‘add & norm’ refers to layer normalisation. Layer normalisation is a progression from a method known as batch normalisation (Ioffe and Szegedy, 2015), and both attempt to address the issue of weights in one layer of an MLP being heavily affected by whatever the output is of the previous layer, within the same MLP (recall that all layers cumulatively affect the difference between the expected and actual output, in the loss function).
Batch normalisation, for each neuron within a layer, seeks to rescale the inputs based on the variances and mean of the whole training data distribution. These variances and mean are actually estimated via samples from each small batch of training data being processed, via a probability distribution, as the computational load would be too high to use the full training dataset; see Ba et al. (2016) section 2 for further details. Then, during the utilisation stage, the inputs are normalised via a mean and variance based on all the means and variances generated during training; see Ioffe and Szegedy (2015) section 3.1.
Essentially both batch normalisation and layer normalisation go back to a secondary school, classical statistics means of normalising a value, but with the mean mu and variance sigma derived via different sources:
<Math input here of the standard normalisation equation, explain sigma and mu as population based stats.>
Layer normalisation, in contrast, sets the mean and variance for each layer within an MLP. Again, mini-batches are used, however, no limits are placed on the minimum size of the batch (a batch size of 1 is permitted).
1.6 Feed forward network - deductions based on training and utilisation input
It is difficult to swiftly summarise the inner mechanisms of a neural network (a.k.a. feed forward network). These concepts are covered from the ground up in University of Manchester’s COMP24112, which builds up the concept of the neural network in the following steps:
- models - linear regression models and the value of the weights within them
- training - via a dataset setup for supervised learning, training a linear regression model by updating weights via gradient descent
- forward propagation - connecting neurons and deriving inputs, via the outputs of previous neurons after being processed through activation functions, to form an MLP
- backpropagation - training an MLP via backpropagation (multiplication of a series of partial derivatives to deduce exactly how a change in each weight affects the loss function) and gradient descent
The contents of COMP24112 are covered at: https://cs-notes.xza.fr
Essentially a neural network is a mathematical implementation of the mechanisms of the human brain, in that, during utilisation, outputs are derived, via inputs to a series of functions in the previous layer. These functions either manipulate the data, or pass the data onto the next layer within an MLP (or don’t, if the output does not meet the conditions of the activation function deployed).
The introductory internals of how a neural network works are considered prerequisite knowledge for this text, and can be found already in resources such as Harvard’s Undergraduate Fundamentals of Machine Learning textbook, or in Appendix A of LLM Deep Dive from Springer.
Specifically however, in the context of the LLM, the feed forward network:
- inputs the normalised self-attention outputs, one feed forward network per token
- works independently on each token (unlike the self-attention mechanism, which slots them all into concurrent matrix operations), whilst using the newly embedded relational data to other tokens from the self-attention layer
- the feed forward network within the encoder must output a data format compatible with the multi-head self-attention layer (i.e. a high dimensional vector representation of a token)
- scales the dimensionality of the input upwards, and then back downwards
Scaling dimensionality upwards can make it easier to group data into appropriate patterns, which is vital during training the model to generate patterns, and then again during utilisation to match the pre-generated intricate patterns. Imagine transitioning from 2D co-ordinates to 3D co-ordinates to generate or match the shape of a river. This is essentially the purpose of the feed forward neural network in the context of an LLM - finding the patterns in a text corpus during training, and then applying them to an input sequence during utilisation.
1.7 Linear - converting contextual abstractions to vocabulary
The decoder will output a set of numerical vectors (one per token, including both the input sequence and the tokens generated so far) of a prespecified, computationally efficient dimension (e.g. 5120 is the default configuration, in the case of Llama-4). The linear layer, acting as a classifier, will convert each vector to a new vector, the size of the new vector being of the same size as the total known vocabulary of the model (potentially to ~120,000 words - the size of a recent paperback dictionary).
Diagram 4: a vector representing a processed token (x1) of dimension 4 being run through an SLP of 5 neurons, meaning a model with a total known vocabulary of 5 words. Note that each neuron will be fed a different set of trainable weights.
The name linear is due to this stage consisting of a fully-connected linear layer of an SLP (without an activation function, or equivalently, an identity activation function). Recall that an SLP has no hidden layers - the inputs are combined with weights to generate a set of outputs. As with all parameters, these weights are trained when the Transformer is trained in its entirety. The vectors output by the SLP will each consist of a set of floats.
The linear layer can alternatively be thought of as a matrix of trainable weights.
https://jalammar.github.io/illustrated-transformer/
https://www.datacamp.com/tutorial/how-transformers-work
1.8 Softmax - selecting predictions via probability
Generally speaking, outside the context of ML and LLMs, a softmax function will take a set of numbers and convert them to a probability distribution. This is done by:
- finding the total of the initial set of numbers
- generating a new set of fractions based on the initial set as numerators, and the denominators as the total of the initial set
Looking back at LLMs, within ML generally, the unprocessed predictions (such as those output by the linear layer) are often described as logits. The logits are then normalised by a softmax function, to produce a set of tokens with probability scores, the total probability for all the predictions adding up to 1.
<table of words and probabilities here>
User interfaces built on top of model utilisation/inference may add a temperate gauge, and adjustments to this temperature gauge by the user indicate whether the model’s less likely predictions or more likely predictions are selected. Some LLMs may set a fixed temperature, with some element of randomness, so that the model consistently selects probable tokens but not always the same tokens, to add some creativity.
1.9 Connections -
An initial glimpse of the Transformer may raise questions regarding why both encoder and decoder have inputs. The following diagram, of an LLM used for machine translation, depicts why - the decoder blocks take as input the encoded initial input sequence, as well as the generated tokens that the Transformer itself has generated.
https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/encoder_decoder/EncoderDecoder.png
https://huggingface.co/blog/encoder-decoder
https://miro.medium.com/v2/resize:fit:640/format:webp/0*mRSagmAh2iiIC5jy.gif
Original source: https://www.youtube.com/watch?v=4Bdc55j80l8
Sources, ordered by chronological access during authorship
“Large Language Models: A Deep Dive” by Uday Kamath, Kevin Keenan, Garrett Somers, Sarah Sorenson; Springer, 2024
https://link.springer.com/book/10.1007/978-3-031-65647-7
“The Hundred-Page Language Models Book” by Andriy Burkov
https://thelmbook.com
“Attention Is All You Need” by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
https://arxiv.org/abs/1706.03762
Tensor2Tensor documentation by Google Brain team - Walkthrough
https://github.com/tensorflow/tensor2tensor
“What are word embeddings?” by Joel Barnard
https://www.ibm.com/think/topics/word-embeddings
“An intuitive introduction to text embeddings” by Kevin Henner
https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/
“A Gentle Introduction to Positional Encoding” by Mehreen Saeed
https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/
“You could have designed state of the art positional encoding” by Christopher Fleetwood
https://huggingface.co/blog/designing-positional-encoding
“Transformers - A Primer” by Justin Seonyong Lee
https://www.columbia.edu/~jsl2239/transformers.html
“Deep Residual Learning for Image Recognition” by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
https://arxiv.org/abs/1512.03385
“Layer Normalization” by Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
https://www.cs.utoronto.ca/~hinton/absps/LayerNormalization.pdf
“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Sergey Ioffe, Christian Szegedy
https://arxiv.org/abs/1502.03167
“The Illustrated Transformer” by Jay Alammar
https://jalammar.github.io/illustrated-transformer/
Transformers documentation - Llama 4 by Meta employees & contributors
https://huggingface.co/docs/transformers/v4.53.2/en/model_doc/llama4
“What is a tensor?” by University of Cambridge department of materials science
https://www.doitpoms.ac.uk/tlplib/tensors/what_is_tensor.php
“Tensor vs Matrix: an example with computer vision” by Juan Zamora-Mora
https://www.doczamora.com/tensor-vs-matrix-an-example-with-computer-vision
“Single Layer Perceptron and Multi Layer Perceptron” by Abhishek Jain
https://medium.com/@abhishekjainindore24/68ce4e8db5ea
“What is an encoder-decoder model?” by Jacob Murel, Joshua Noble
https://www.ibm.com/think/topics/encoder-decoder-model
“Transformer-based Encoder-Decoder models” by Patrick von Platen
https://huggingface.co/blog/encoder-decoder
“Speech and Language Processing” by Daniel Jurafsky, James H. Martin
https://web.stanford.edu/~jurafsky/slp3/