In 2017 a paper titled "Attention Is All You Need" introduced the transformer. Within a few years it had displaced almost everything before it and become the shared foundation of modern AI. Understanding it is understanding the field.
01One architecture to rule them
What is remarkable about the transformer is its reach. The same basic design, scaled and adapted, handles text, images, audio, protein structures and code. Learn it once and the whole landscape of modern AI becomes legible.
02The problem before
Earlier language models read text one word at a time, passing a running summary forward. This was slow — you could not parallelise it — and it forgot: by the end of a long paragraph, the start had faded. Long-range connections, like a pronoun referring back many sentences, were precisely what these models handled worst.
03Attention, the core idea
The transformer's answer is attention. Instead of a fading summary, every word can look directly at every other word and decide how much each one matters for understanding it. When processing "it", the model can attend strongly to the noun "it" refers to, however far back that noun sits.
How much the word "it" attends to each earlier word. It correctly focuses on "cat" — its referent — letting the model resolve meaning across distance.
04Why parallelism mattered
Because attention compares all words at once rather than marching through them in order, a transformer can process an entire sequence in parallel. That fit perfectly with modern GPUs, which excel at doing many calculations simultaneously. Suddenly models could be trained on far more data, far faster — and scale, it turned out, was the unlock.
05Building a transformer
A transformer stacks the same block many times. Each block has an attention layer that mixes information between positions, followed by a small feed-forward network that processes each position. Inputs are first turned into embeddings with position information added, then passed up through the stack to produce the output.
The block is deliberately simple and repeated — sometimes dozens or hundreds of times. Depth plus scale, not architectural complexity, is where the capability comes from.
06Why it spread everywhere
The transformer won because it is general and scalable. Treat anything as a sequence of tokens — words, image patches, audio frames — and the same machinery applies. It parallelises beautifully and keeps improving as you add data and compute. That combination of generality and scalability is why one 2017 idea now sits under almost everything.
What to remember
- The transformer (2017) is the shared architecture behind modern AI.
- Older models read sequentially — slow, and forgetful over long spans.
- Attention lets every token look at every other and weigh its relevance.
- Processing all positions at once enables GPU parallelism and scale.
- A transformer stacks simple blocks: attention plus a feed-forward layer.
- Generality plus scalability is why it spread across every modality.
