AI Development Milestone

The Transformer Revolution

In 2017, Google researchers introduced a radically new neural network architecture that fundamentally changed the direction of artificial intelligence development. Explore interactively how it works.

Explore the architecture

"We don't understand intelligence. We don't know how the brain works… and we don't really understand these neural networks either. It's like looking into a Petri dish."

— Connor Leahy, AI researcher

2017
Transformer introduced
8
Author team (Google)
175B
GPT-3 parameters
100+
Thousand citations

Milestones in AI Development

From deep learning to the Transformer revolution and today's generative AI.

2012
AlexNet and the Deep Learning Breakthrough
Alex Krizhevsky's deep neural network wins the ImageNet competition with half the error rate of the runner-up. This moment launches the deep learning revolution.
2014
The Emergence of the Attention Mechanism
Bahdanau and colleagues introduce the attention mechanism into RNN-based translation models, enabling selective focus on relevant parts of the input sequence.
2017 — The Turning Point
"Attention Is All You Need"
Vaswani and colleagues present the Transformer architecture, built entirely on self-attention, abandoning RNNs and convolutions. This paper is one of the most-cited publications in AI history.
2018
GPT-1 and BERT
OpenAI creates the GPT-1 model (117M parameters) and Google creates BERT. Both build on the Transformer architecture but with different approaches: GPT is decoder-based, BERT is encoder-based.
2020
GPT-3: 175 Billion Parameters
OpenAI's GPT-3 model demonstrates remarkable text understanding and generation without task-specific fine-tuning. This model makes AI capabilities visible to a broader audience.
2022-2023
ChatGPT and the Generative AI Explosion
The launch of ChatGPT (GPT-3.5/GPT-4) sparks explosive interest in large language models. The Transformer is now capable of generating images, audio, and video (DALL-E, Stable Diffusion, Sora).

The Structure of the Transformer

Click on individual components to view detailed explanations. The architecture is draggable and zoomable.

Interactive Diagram

Transformer Architecture

The Transformer architecture is an encoder-decoder structure introduced by the Google Brain team in 2017 in the paper "Attention Is All You Need."

Its most important innovation is the self-attention mechanism, which allows the model to focus on any part of the input sequence regardless of its distance.

Click on the architecture elements to view their detailed descriptions, or choose from the buttons to highlight individual components.

Self-Attention Mechanism

Click on a word to see how the Transformer model attends to the rest of the sentence to understand context.

Click a word to display attention weights

Building Blocks of the Transformer

Fundamental concepts needed to understand the architecture.

Token Embedding

Converts the words of a text into numerical vectors that encode semantic meaning in a high-dimensional space.

Positional Encoding

Since the Transformer processes words in parallel, it encodes word order using sine and cosine functions.

Query, Key, Value

Self-attention learns three matrices: Q (what am I looking for?), K (what am I?), V (what value do I carry?). Together they determine the attention weights.

Multi-Head Attention

Multiple parallel attention "heads" operate simultaneously, each capturing different aspects: syntactic, semantic, and positional relationships.

Feed-Forward Network

After each attention layer, a fully connected network enriches the representation with nonlinear transformations.

Residual Connection

Add & Norm layers ensure unobstructed gradient flow and preserve the original information in the deep network.

Why Is It Revolutionary?

A comparison of the world before and after the Transformer.

Aspect RNN / LSTM (before) Transformer (after)
Processing Sequential (word by word) Parallel (all at once)
Long-range dependencies Handles poorly With global attention
Training speed Slow, difficult to leverage GPUs Fast, massively parallel
Scalability Limited scalability Billions of parameters
Transferability Retrain per task General purpose, fine-tuning
Multimodality Mostly text Text, image, audio, video

"GPT was a general-purpose pattern learner. Before that there was very little transferability — you had to retrain for every task. GPT changed that."

— Connor Leahy, on AI development

Citations

1. Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762

2. Connor Leahy — "AI is MUTATING: And We Don't Know What It is Doing." YouTube video

3. Polo Club of Data Science. "LLM Transformer Model Visually Explained." poloclub.github.io

4. Wikipedia. "Timeline of artificial intelligence." wikipedia.org