1.0 Primer
Internal mechanisms of an LLM
On the topic of LLM architecture, a paper frequently cited is Vaswani et al. (2017). This paper pioneered the concept of a model known as a Transformer, which is currently the state-of-the-art building block of sequence-to-sequence modelling tasks, which all modern LLMs (such as Meta’s Llama-4) are based upon. The original Transformer was developed for translation tasks; the encoder for the source language, the decoder for the target language (it takes previously generated data as input). Nowadays, common LLMs oriented around text generation (e.g. OpenAI’s GPT, Meta’s Llama) feature only the decoder.[1] However, an encoder remains of value for making deductions from text.
Diagram 1.0: The Transformer, Vaswani et al. (2017)
The Transformer is made up of an encoder (left side of diagram 1) and a decoder (right side of diagram 1).
The encoder takes an input sequence and converts this into a contextual memory. The decoder takes the contextual memory and generates an output sequence, as well as previously generated output.
Glossary
Large Language Model - a model typically oriented around a highly scaled Transformer
Transformer - a model oriented on a particular set of functions and parameters, which emphasises text processing (both generation and comprehension)
Model - architecture of a trainable model, features functions and trainable parameters; the word is not used to mean a model packaged with trained parameters (this is described as a checkpoint)
Parameter - a position within a model, where a numerical value can be stored and adjusted during the training period
Training - the initial stage of the lifetime of a model, in which the parameters within a model are assigned numerical values, which are then adjusted as data is run through the model, such that the output of the model becomes closer to a target output
Loss function - a means of measuring the difference between the output of a model and the target output
Backpropagation - deducing how a specific numerical allocation to a parameter affects the numerical output of a loss function
Layer - a part of a Transformer, e.g. feed-forward network
SIMD processor - Single Instruction Multiple Data, meaning a processor that performs the same operation on all the data it holds, such as addition, or multiplication
GPU - a common type of SIMD processor, originally oriented around graphics processing, but equally useful for matrix operations, generally
Tile - a matrix of data, m x n, which is the smallest quantity of data that a GPU can operate on