1.1 Input embeddings

Converting natural language to numerical representations.

Transformer-embeddings

Diagram 1.1.0: The Transformer, Vaswani et al. (2017)

The inputs to the encoder are first split into tokens. A token may consist of a whole word, or a portion of a word.
Each token is then mapped to a multidimensional vector (set of numbers in a format ready for matrix operations). An example of a vector representation would be a word embedding. A word embedding represents tokens in a dense way, such that similar tokens show high cosine similarity when compared (in vector representation). This dense representation also allows models to comprehend a large quantity of different tokens.

tokens-vectors Diagram 1.1.1: A depiction of a sequence being converted to tokens, and then one token being converted to a 9-dimension vector.

Word embeddings may be developed via:

neural networks and large quantities of unlabeled training data, by optimising a loss function based on tokens that are expected to be close together (e.g. Word2Vec from Google, 2013)
global co-occurrence statistics and subword information (e.g. GloVe from Pennington et al. 2014)

Modern word embeddings are typically of very high dimension, e.g. Meta’s LLM Llama-4 uses 5120 dimensions^[1] for larger models. Mapping a token to a word embedding is a one-to-one mapping, once the embeddings have been finalised.

A theoretical example of a vector embedding could be a 6-dimensional vector, where each dimension measures how strongly the token/word is a member of the following groups:

noun
verb
adjective
adverb
prepositions
connectives

Realistically, if generated via a neural network, each dimension of an embedding could represent membership to a grouping of tokens that share some connection, but the connection may never have been formally defined by humans. Furthermore, multiple dimensions of these unique groupings could interrelate.

Representing data as vectors is a method that can be applied to a variety of contexts, and so is independently a large domain even outside the context of LLMs. For example, representations may be utilised to find similarities between documents, for classification or information retrieval purposes.

References

[1] Llama-4 model documentation - HuggingFace

1.0 Primer 1.2 Positional encoding