1.8 Softmax

Selecting predictions via probability.

Transformer-softmax

Diagram 1.0: The Transformer, Vaswani et al. (2017)

Generally speaking, outside the context of ML and LLMs, a softmax function will take a set of numbers and convert them to a probability distribution. This is done by:

finding the total of the initial set of numbers
generating a new set of fractions based on the initial set as numerators, and the denominators as the total of the initial set

Looking back at LLMs, within ML generally, the unprocessed predictions (such as those output by the linear layer) are often described as logits. The logits are then normalised by a softmax function, to produce a set of tokens with probability scores, the total probability for all the predictions adding up to 1.

Output variable	Variable value (logit)	Probability	Word
$p_{1}$	0.5	0.05	Sunny
$p_{2}$	1	0.1	Cloudy
$p_{3}$	2	0.2	Rainy
$p_{4}$	0.5	0.05	Misty
$p_{5}$	6	0.6	Snowy

Probability regards the probability that the associated token is the next token generated by the LLM.

User interfaces built on top of model utilisation/inference may add a temperate gauge, and adjustments to this temperature gauge by the user indicate whether the model’s less likely predictions or more likely predictions are selected. Some LLMs may set a fixed temperature, with some element of randomness, so that the model consistently selects probable tokens but not always the same tokens, to give the impression of creativity.

1.7 Linear 1.9 Connections