1.7 Linear
Converting contextual abstractions to vocabulary.
Diagram 1.7.0: The Transformer, Vaswani et al. (2017)
The decoder will output a set of numerical vectors (one per token, including both the input sequence and the tokens generated so far) of a prespecified, computationally efficient dimension. The linear layer, acting as a classifier, will convert each vector to a new vector, the size of the new vector being of the same size as the total known vocabulary of the model (the Llama-4 default being 202,048,[1] which is larger than the size of a recent paperback dictionary, at 120,000 words).
Diagram 1.7.1: a vector representing a processed token (x1) of dimension 4 being run through an SLP of 5 neurons, meaning a model with a total known vocabulary of 5 words. Note that each neuron will be fed a different set of trainable weights.
| Output variable | Variable value | Word |
|---|---|---|
| 0.5 | Sunny | |
| 1 | Cloudy | |
| 2 | Rainy | |
| 0.5 | Misty | |
| 6 | Snowy |
The above table relates to the diagram 1.7.1 above. For example, if the input sequence was “today’s weather?”, and the LLM had learnt to make predictions related to weather such as from chat conversations and past weather data. Vector x1 could have been generated by the Transformer, based on the token “weather?” being input to the Transformer, scores then assigned by the linear layer as to which upcoming tokens are most likely. Once the Transformer has been trained, there is an injective relation (one-to-one mapping) between each pi variable and every token that the model is able to generate.
The name linear is due to this stage consisting of a fully-connected linear layer of an SLP (without an activation function, or equivalently, an identity activation function). Recall that an SLP has no hidden layers - the inputs are combined with weights to generate a set of outputs. As with all parameters, these weights are trained when the Transformer is trained in its entirety. The vectors output by the SLP will each consist of a set of floats.
Due to the simplicity of an SLP with no activation function, the linear layer can alternatively be thought of as a matrix of trainable weights; again optimisable for SIMD processors (GPUs).
Diagram 1.7.2: an equivalent representation of the SLP in Diagram 1.7.1.