1.0 Primer

Internal mechanisms of an LLM

On the topic of LLM architecture, a paper frequently cited is Vaswani et al. (2017). This paper pioneered the concept of a model known as a Transformer, which is currently the state-of-the-art building block of sequence-to-sequence modelling tasks, which all modern LLMs (such as Meta’s Llama-4) are based upon. The original Transformer was developed for translation tasks; the encoder for the source language, the decoder for the target language (it takes previously generated data as input). Nowadays, common LLMs oriented around text generation (e.g. OpenAI’s GPT, Meta’s Llama) feature only the decoder.[1] However, an encoder remains of value for making deductions from text.

Transformer-full

Diagram 1.0: The Transformer, Vaswani et al. (2017)

The Transformer is made up of an encoder (left side of diagram 1) and a decoder (right side of diagram 1).

The encoder takes an input sequence and converts this into a contextual memory. The decoder takes the contextual memory and generates an output sequence, as well as previously generated output.

Glossary

Large Language Model - a model typically oriented around a highly scaled Transformer

Transformer - a model oriented on a particular set of functions and parameters, which emphasises text processing (both generation and comprehension)

Model - architecture of a trainable model, features functions and trainable parameters; the word is not used to mean a model packaged with trained parameters (this is described as a checkpoint)

Parameter - a position within a model, where a numerical value can be stored and adjusted during the training period

Training - the initial stage of the lifetime of a model, in which the parameters within a model are assigned numerical values, which are then adjusted as data is run through the model, such that the output of the model becomes closer to a target output

Loss function - a means of measuring the difference between the output of a model and the target output

Backpropagation - deducing how a specific numerical allocation to a parameter affects the numerical output of a loss function

Layer - a part of a Transformer, e.g. feed-forward network

SIMD processor - Single Instruction Multiple Data, meaning a processor that performs the same operation on all the data it holds, such as addition, or multiplication

GPU - a common type of SIMD processor, originally oriented around graphics processing, but equally useful for matrix operations, generally

Tile - a matrix of data, m x n, which is the smallest quantity of data that a GPU can operate on

References

[1] How do Transformers work? - HuggingFace