In 2017, Google researchers introduced a radically new neural network architecture that fundamentally changed the direction of artificial intelligence development. Explore interactively how it works.
Explore the architecture"We don't understand intelligence. We don't know how the brain works… and we don't really understand these neural networks either. It's like looking into a Petri dish."
From deep learning to the Transformer revolution and today's generative AI.
Click on individual components to view detailed explanations. The architecture is draggable and zoomable.
The Transformer architecture is an encoder-decoder structure introduced by the Google Brain team in 2017 in the paper "Attention Is All You Need."
Its most important innovation is the self-attention mechanism, which allows the model to focus on any part of the input sequence regardless of its distance.
Click on the architecture elements to view their detailed descriptions, or choose from the buttons to highlight individual components.
Click on a word to see how the Transformer model attends to the rest of the sentence to understand context.
Click a word to display attention weights
Fundamental concepts needed to understand the architecture.
Converts the words of a text into numerical vectors that encode semantic meaning in a high-dimensional space.
Since the Transformer processes words in parallel, it encodes word order using sine and cosine functions.
Self-attention learns three matrices: Q (what am I looking for?), K (what am I?), V (what value do I carry?). Together they determine the attention weights.
Multiple parallel attention "heads" operate simultaneously, each capturing different aspects: syntactic, semantic, and positional relationships.
After each attention layer, a fully connected network enriches the representation with nonlinear transformations.
Add & Norm layers ensure unobstructed gradient flow and preserve the original information in the deep network.
A comparison of the world before and after the Transformer.
| Aspect | RNN / LSTM (before) | Transformer (after) |
|---|---|---|
| Processing | Sequential (word by word) | ✓ Parallel (all at once) |
| Long-range dependencies | ✗ Handles poorly | ✓ With global attention |
| Training speed | Slow, difficult to leverage GPUs | ✓ Fast, massively parallel |
| Scalability | Limited scalability | ✓ Billions of parameters |
| Transferability | ✗ Retrain per task | ✓ General purpose, fine-tuning |
| Multimodality | ✗ Mostly text | ✓ Text, image, audio, video |
"GPT was a general-purpose pattern learner. Before that there was very little transferability — you had to retrain for every task. GPT changed that."
1. Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. arxiv.org/abs/1706.03762
2. Connor Leahy — "AI is MUTATING: And We Don't Know What It is Doing." YouTube video
3. Polo Club of Data Science. "LLM Transformer Model Visually Explained." poloclub.github.io
4. Wikipedia. "Timeline of artificial intelligence." wikipedia.org