1. The Transformer

Transformer-full

Diagram 1.0: The Transformer, Vaswani et al. (2017)

This chapter iterates through the layers of The Transformer from the bottom-up.

Sources of research for this chapter, ordered by chronological access during authorship:

“Large Language Models: A Deep Dive” by Uday Kamath, Kevin Keenan, Garrett Somers, Sarah Sorenson; Springer, 2024
https://link.springer.com/book/10.1007/978-3-031-65647-7

“The Hundred-Page Language Models Book” by Andriy Burkov
https://thelmbook.com

“Attention Is All You Need” by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
https://arxiv.org/abs/1706.03762

Tensor2Tensor documentation by Google Brain team - Walkthrough
https://github.com/tensorflow/tensor2tensor

“What are word embeddings?” by Joel Barnard
https://www.ibm.com/think/topics/word-embeddings

“An intuitive introduction to text embeddings” by Kevin Henner
https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

“A Gentle Introduction to Positional Encoding” by Mehreen Saeed
https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

“You could have designed state of the art positional encoding” by Christopher Fleetwood
https://huggingface.co/blog/designing-positional-encoding

“Transformers - A Primer” by Justin Seonyong Lee
https://www.columbia.edu/~jsl2239/transformers.html

“Deep Residual Learning for Image Recognition” by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
https://arxiv.org/abs/1512.03385

“Layer Normalization” by Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
https://www.cs.utoronto.ca/~hinton/absps/LayerNormalization.pdf

“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Sergey Ioffe, Christian Szegedy
https://arxiv.org/abs/1502.03167

“The Illustrated Transformer” by Jay Alammar
https://jalammar.github.io/illustrated-transformer/

Transformers documentation - Llama 4 by Meta employees & contributors
https://huggingface.co/docs/transformers/v4.53.2/en/model_doc/llama4

“What is a tensor?” by University of Cambridge department of materials science
https://www.doitpoms.ac.uk/tlplib/tensors/what_is_tensor.php

“Tensor vs Matrix: an example with computer vision” by Juan Zamora-Mora
https://www.doczamora.com/tensor-vs-matrix-an-example-with-computer-vision

“Single Layer Perceptron and Multi Layer Perceptron” by Abhishek Jain
https://medium.com/@abhishekjainindore24/68ce4e8db5ea

“What is an encoder-decoder model?” by Jacob Murel, Joshua Noble
https://www.ibm.com/think/topics/encoder-decoder-model

“Transformer-based Encoder-Decoder models” by Patrick von Platen
https://huggingface.co/blog/encoder-decoder

“Speech and Language Processing” by Daniel Jurafsky, James H. Martin
https://web.stanford.edu/~jurafsky/slp3/