1.6 Feed-forward Network

Deductions based on training and utilisation input.

Transformer-feed-forward

Diagram 1.6.0: The Transformer, Vaswani et al. (2017)

Prerequisite knowledge

It is difficult to swiftly summarise the inner mechanisms of a neural network (a.k.a. feed forward network), and this topic is deserving of a separate chapter. On Machine Learning courses, one method of teach is to build up the concept of the neural network in the following steps:

models - linear regression models and the value of the weights/parameters within them
training - via a dataset setup for supervised learning, training a linear regression model by updating weights via gradient descent
forward propagation - connecting neurons and deriving inputs, via the outputs of previous neurons after being processed through activation functions, to form a neural network
backpropagation - training an neural network via backpropagation (multiplication of a series of partial derivatives to deduce exactly how a change in each weight affects the loss function) and gradient descent

Theoretical overview

Neural network

Diagram 1.6.1: Example of a basic feed-forward neural network, with only 3 inputs, 1 hidden layer, and 2 outputs.

One neuron within neural network

Diagram 1.6.2: Continued example; inspection of inputs, contents, and outputs of one neuron within the hidden layer of Diagram 1.6.1.

What is a neural network?

A neural network is made up of layers
Each layer is made up of neurons
Each neuron within the same layer is assigned the same function, but different parameters
The parameters of the functions are adjusted during the training stage of a neural network
An example of a parameter would be the m and the c in y = mx + c
The types of functions involved are typically linear regression functions, with many parameters, such as in Diagram 1.6.2
A linear regression function within a neural network may contain any quantity of weights (w_i)
Inputs are taken from the outputs of a previous layer of functions, processed, and then output to the next layer of functions
Some layers consist (solely) of functions which are categorised as activation layers, and they may cause certain neurons within a layer to output zero unless a specific condition is met
The notion of the activation layer is based on human brain synapses, in which neurons within the brain only pass on information when they are highly stimulated enough
As a result, neural networks are based on the mechanisms of the human brain

Training overview

Sample neural network, with training process

Diagram 1.6.3: Overview of the training process for a neural network with 1 hidden layer.

How is a neural network trained?

A large quantity of data, that features neural network inputs, and expected neural network outputs, is collated; this is known as training data
A loss function is selected for the neural network (both suitable for the quantity of inputs and outputs of the neural network, and also appropriate for the type of data input and output), which quantifies the difference between the expected output of the neural network (derived via the associated training data input), and the actual output of the neural network
One instance within the training data is run through the neural network, and the difference between the expected and actual output is deduced
The parameters within the network are adjusted (via a process known as backpropagation, which involves a chain of partial derivatives, and can be automated via software libraries)

Relevance to the Transformer

Specifically, in the context of the Transformer, the feed forward network scales the dimensionality of the input upwards, whilst processing through hidden layers, and then back downwards to the original dimensionality (by default, from 5120 to 8192, in the case of Llama-4, i.e 8192 neurons within 1 hidden layer).^[1]

Scaling dimensionality upwards can make it easier to group data into appropriate patterns, which is vital during training the model to generate patterns, and then again during utilisation to match the pre-generated intricate patterns. Imagine transitioning from 2 dimensional coordinates to 3 dimensional coordinates to identify the shape of a river. This is essentially the purpose of the feed forward neural network in the context of an LLM - finding the patterns in a text corpus during training, and then applying the patterns to an input sequence during utilisation.

For further information on Machine Learning, see notes at: https://cs-notes.xza.fr

References

[1] Llama-4 documentation - HuggingFace

1.5 Add & Norm 1.7 Linear