Transformers: The Architecture That Changed Everything

Introduced in 2017's landmark "Attention Is All You Need" paper, the Transformer architecture is the foundation of GPT, Claude, BERT, and virtually every modern language model. It did not just improve AI -- it redefined what was possible.

↓ Scroll to Explore ↓

Why Transformers Matter

Before 2017, sequence models were slow, forgetful, and fundamentally limited. Transformers solved all three problems at once.

Before Transformers

RNNs processed words one at a time -- painfully sequential and slow to train
LSTMs/GRUs improved memory, but still struggled with long-range dependencies beyond a few hundred tokens
No parallelism -- each step depended on the previous one, making GPU acceleration nearly impossible
Training on large datasets took weeks or months

After Transformers

Parallel processing -- all tokens processed simultaneously, fully utilizing modern GPUs
Attention mechanism lets any token attend to any other token, regardless of distance
Scales beautifully -- more data + more compute = predictably better performance
Enabled models with billions and trillions of parameters

Transformers enabled the scaling revolution. They now power:

Large Language ModelsGPT, Claude, Gemini, Llama

Image GenerationDALL-E, Stable Diffusion

Protein FoldingAlphaFold

Code GenerationCodex, GitHub Copilot

Speech RecognitionWhisper, wav2vec

The Attention Mechanism: The Key Innovation

The core idea behind transformers is deceptively simple: when processing a word, the model should be able to focus on the most relevant parts of the entire input. This is called attention.

What is Attention?

In traditional models, each word only "sees" its immediate neighbors. Attention allows every word to directly look at every other word in the sequence, computing a relevance score. High-relevance words get more influence. Think of it as the model asking: "For this word I'm processing, which other words in the sentence should I pay the most attention to?"

Self-Attention

Self-attention is when a sequence attends to itself. Each token in the input computes three vectors -- a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). The attention score between two tokens is the dot product of the Query of one with the Key of the other, scaled by the square root of the dimension, then passed through a softmax to get weights. These weights determine how much each Value contributes to the output.

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

The scaled dot-product attention formula from "Attention Is All You Need"

Attention in Action

"The animal didn't cross the street because it was too tired."

When the model processes the word "it", the attention mechanism assigns a high weight to "animal" and a low weight to "street." The model has learned that "tired" is a property of living things, so "it" most likely refers to "animal." This is exactly the kind of long-range dependency that RNNs struggled with, and transformers handle effortlessly through attention.

Multi-Head Attention: Seeing Multiple Patterns at Once

Instead of computing attention once, transformers run multiple attention operations in parallel -- called "heads." Each head can learn to focus on different types of relationships. The outputs of all heads are concatenated and linearly projected to produce the final result.

Head 1: Syntax

Learns grammatical relationships -- subject-verb agreement, modifier connections

Head 2: Coreference

Tracks pronouns back to their referents -- "it" to "animal", "she" to "Dr. Smith"

Head 3: Semantics

Captures meaning-level relationships -- "tired" relates to living entities

Head 4: Position

Tracks proximity and word order -- "didn't" negates the immediately following verb

GPT-3 uses 96 attention heads per layer. Each head operates on a 128-dimensional subspace of the 12,288-dimensional model. This is not a design choice -- it is a fundamental reason why these models are so capable.

Transformer Architecture: Building Blocks

The original Transformer consists of an Encoder (processes input) and a Decoder (generates output). Here is how data flows through the architecture from bottom to top.

Input Embedding + Positional Encoding

Tokens are converted into dense vectors (embeddings). Since transformers process all tokens in parallel and have no inherent sense of order, sinusoidal positional encodings are added to inject position information. Each position gets a unique signature based on sine and cosine functions at different frequencies.

↓

Encoder Stack

Processes the full input sequence. Each layer applies self-attention (every token attends to every other token), then a position-wise feed-forward network. The original paper used N=6 identical layers stacked together.

Multi-Head Self-Attention Feed-Forward Network Layer Normalization Residual Connections

Repeated N times (N=6 in the original paper)

↓

Decoder Stack

Generates output one token at a time (during inference). Uses masked self-attention so each position can only attend to earlier positions -- preventing it from "seeing the future." Cross-attention layers attend to the encoder output, allowing the decoder to draw from the full input context.

Masked Self-Attention Cross-Attention (Encoder-Decoder) Feed-Forward Network Layer Norm + Residuals

Repeated N times (N=6 in the original paper)

↓

Linear Layer + Softmax

The decoder output is projected through a linear layer to the vocabulary size, then softmax converts these logits into a probability distribution over all possible next tokens. The highest-probability token (or a sample from the distribution) is selected as the output.

Residual connections add the input of each sub-layer to its output (x + Sublayer(x)), preventing the vanishing gradient problem in deep networks. Layer normalization stabilizes training by normalizing activations. Together, these allow transformers to be stacked much deeper than previous architectures.

Encoder-Only vs. Decoder-Only vs. Encoder-Decoder

The original Transformer used both an encoder and decoder. Since then, researchers found that using only one half -- or keeping both -- works better for specific tasks.

Encoder-Only

Understanding & Classification

BERT, RoBERTa, DeBERTa

Processes the entire input bidirectionally -- every token can attend to every other token in both directions. Produces rich contextual representations of the input. Excellent at understanding meaning and extracting information, but cannot generate text autoregressively.

Best for: Sentiment analysis, named entity recognition, question answering (extractive), text classification, semantic search

Decoder-Only

Text Generation (The Dominant Paradigm)

GPT-4, Claude, Llama, Gemini, Mistral

Processes tokens left-to-right with causal (masked) attention -- each token can only attend to previous tokens. Trained to predict the next token. Despite seeing only leftward context, these models develop deep understanding through massive scale and data. This is the architecture behind virtually all modern LLMs.

Best for: Text generation, conversation, reasoning, code generation, creative writing, general-purpose AI

Encoder-Decoder

Sequence-to-Sequence

T5, BART, mBART, Flan-T5

Uses the full original architecture: the encoder reads the complete input bidirectionally, and the decoder generates output autoregressively while cross-attending to the encoder's representations. Naturally suited for tasks where input and output are distinct sequences.

Best for: Translation, summarization, text-to-text tasks, structured output generation

Why did decoder-only win for generative AI? Simplicity and scalability. Decoder-only models need only one stack, making them easier to scale. More importantly, next-token prediction turned out to be a remarkably powerful training objective -- when done at sufficient scale, it produces models that can reason, translate, code, and converse, all from a single architecture trained with a single objective.

Tokenization: How Transformers Read Text

Transformers do not read characters or whole words. They operate on tokens -- sub-word units that balance vocabulary size with representational coverage.

Example: How tokenization works

"Transformers are unbelievably powerful"

↓ Tokenizer ↓

Transform ers are un believ ably powerful

7 tokens from 5 words. Common words stay whole; rare words are split into known subwords.

Byte-Pair Encoding (BPE)

The most common tokenization algorithm, used by GPT models. Starts with individual characters, then iteratively merges the most frequent pair of adjacent tokens into a new token. After thousands of merges, you get a vocabulary of sub-word units (typically 30K-100K tokens) that efficiently represents text. Common words become single tokens; rare words are split into familiar pieces.

SentencePiece

A language-agnostic tokenizer that treats the input as a raw byte stream, requiring no pre-tokenization or language-specific rules. Used by T5, Llama, and many multilingual models. Supports both BPE and Unigram algorithms. Its language-agnosticism makes it especially valuable for multilingual models.

Why Tokenization Matters

Tokenization directly impacts context windows and cost. A model with a 128K-token context window can process roughly 96,000 English words. API costs are measured per token. The choice of tokenizer affects how efficiently different languages are represented -- English typically gets ~4 characters per token, while some languages may only get 1-2, meaning they "use up" the context window faster and cost more per word.

The Context Window

The context window is the maximum number of tokens a transformer can process at once -- both input and output combined. It is fundamentally limited by the O(n²) computational cost of self-attention, where n is the sequence length. Early models had 512-2048 tokens. Modern models like Claude support 200K+ tokens, enabled by architectural innovations like Flash Attention and efficient KV-caching.

Scaling Laws and the Foundation Model Era

One of the most important discoveries in modern AI: transformer performance improves predictably as you increase model size, data, and compute. These are known as scaling laws.

110M

BERT
(2018)

1.5B

GPT-2
(2019)

175B

GPT-3
(2020)

540B

PaLM
(2022)

1T+ (est.)

GPT-4
(2023)

Approximate parameter counts over time (not to scale). GPT-4's exact count is undisclosed.

The Scaling Laws

Research from OpenAI (Kaplan et al., 2020) and DeepMind (Hoffmann et al., 2022 -- the "Chinchilla" paper) showed that loss decreases as a power law of model size, dataset size, and compute budget. The Chinchilla finding was especially impactful: models should be trained on roughly 20 tokens per parameter. A 70B model needs ~1.4 trillion training tokens. This shifted the field from "bigger models" to "more data at the right model size."

Emergent Capabilities

At certain scale thresholds, models suddenly develop capabilities they were never explicitly trained for: chain-of-thought reasoning, few-shot learning, code generation, multilingual translation, and even basic arithmetic. These "emergent" abilities appear discontinuously -- a model at 10B parameters cannot do a task, but a model at 100B parameters can. This phenomenon remains one of the most debated topics in AI research.

Foundation Models: Pre-train Once, Fine-tune for Everything

The scaling revolution led to the foundation model paradigm: train a single massive model on broad data (pre-training), then adapt it for specific tasks through fine-tuning, RLHF (Reinforcement Learning from Human Feedback), or simply prompt engineering. One base model can become a chatbot, a code assistant, a medical advisor, or a legal analyst -- without retraining from scratch.

Beyond Text: Transformers Everywhere

The Transformer architecture was designed for language. But its core mechanism -- attention over sequences -- turned out to be universal. Today, transformers dominate nearly every modality in AI.

📸

Vision Transformers (ViT)

Instead of processing pixels with convolutions, ViT splits an image into fixed-size patches (e.g., 16x16 pixels), flattens each patch into a vector, and processes them as a sequence -- just like tokens in a sentence. This approach matches or exceeds CNNs on image classification when trained at scale.

ViT DeiT DINO

🎤

Audio & Speech

Audio is converted into spectrograms (visual representations of sound frequencies over time), which are then processed as sequences. OpenAI's Whisper uses an encoder-decoder transformer trained on 680,000 hours of multilingual audio to achieve human-level speech recognition.

Whisper wav2vec 2.0

🌐

Multimodal Models

The most advanced models now process text, images, audio, and video within a single transformer architecture. Images and audio are encoded into the same embedding space as text tokens, allowing the model to reason across modalities seamlessly.

GPT-4o Gemini Claude (Vision)

🧬

Protein Structure

DeepMind's AlphaFold2 uses a transformer-based architecture to predict 3D protein structures from amino acid sequences. Amino acids are treated as tokens, and attention captures which residues interact in 3D space. This solved a 50-year grand challenge in biology.

AlphaFold2 ESMFold

🤖

Robotics & RL

Transformers are replacing traditional architectures in reinforcement learning and robotics. Decision Transformer frames RL as sequence modeling -- predicting the next action given a sequence of states, actions, and rewards. This approach enables training robot policies on offline data without explicit reward engineering.

Decision Transformer RT-2

🎬

Video Generation

Video generation models treat video as sequences of image frames (or latent patches across time). Transformer-based diffusion models can generate coherent video sequences by attending across both spatial and temporal dimensions.

Sora VideoGPT

The Future of Transformers

Transformers have dominated AI for nearly a decade. But the architecture is not standing still, and potential successors are emerging.

Mixture of Experts (MoE)

Instead of activating all parameters for every token, MoE models route each token to a small subset of specialized "expert" sub-networks. This allows models with trillions of total parameters to run at the cost of a much smaller model. Mixtral, Grok, and likely GPT-4 use this approach. MoE is not a replacement for transformers -- it is a way to make them more efficient.

State Space Models (Mamba)

SSMs like Mamba process sequences in linear time O(n) instead of the quadratic O(n²) of standard attention. They use a recurrence-like mechanism that maintains a compressed state, enabling very long sequences efficiently. While promising for certain tasks, they have not yet matched transformers at the largest scales for language understanding. Hybrid architectures combining SSMs with attention layers are an active research area.

Efficient Attention

Flash Attention (Tri Dao, 2022) restructures the attention computation to be IO-aware, reducing memory reads/writes and enabling 2-4x speedups with no approximation. Sparse attention patterns (attending to only a subset of tokens) and linear attention variants further reduce costs. These innovations have pushed context windows from 2K to 200K+ tokens.

Longer Context Windows

The push toward million-token context windows continues. Techniques like RoPE (Rotary Position Embedding) with dynamic scaling, ring attention across multiple GPUs, and architectural innovations are steadily expanding the amount of information a model can reason over in a single pass. Gemini 1.5 Pro demonstrated 1M tokens, and research continues beyond that.

Will Transformers Be Replaced?

Perhaps eventually, but not soon. Every proposed successor -- RWKVs, SSMs, hyenas, and others -- has struggled to match the transformer at scale. The transformer benefits from massive infrastructure investment, deep theoretical understanding, and a decade of optimization. The more likely near-future is hybrid architectures that combine the strengths of attention with the efficiency of alternative approaches.

Continue Your Learning

Now that you understand the Transformer architecture, explore these related topics to deepen your knowledge.

Next: What is an LLM? →