AITransformersLLM

Transformer Explainer: How LLMs Work

A visual, step-by-step walkthrough of the Transformer architecture — from tokenization to next-token prediction. See attention heads, embeddings, and MLPs in action.

By Visual Explainer25 min readIntermediateInteractive Demo

So, What Actually Is a Transformer?

You've probably used ChatGPT, Gemini, or Claude. You type something in, and out comes surprisingly human-sounding text. But have you ever wondered what's actually happening under the hood?

The answer is a Transformer — a neural network architecture introduced in the legendary 2017 paper "Attention is All You Need."It's the engine behind every modern large language model (LLM). And despite the intimidating name, the core idea is surprisingly elegant: predict the next word.

That's it. Given a sequence of words, the Transformer figures out the most probable next word. Do that over and over, and you get paragraphs, essays, even code.

Let's walk through how it works — step by step, visually. No PhD required.

Transformer Pipeline

Watch data flow through each stage

Input Text
Your prompt goes in
Tokenization
Split into tokens
Embedding
768-dim vectors
Self-Attention
12 heads attend
MLP
Transform & refine
Next Token
Softmax → predict

Step 1: Embedding — Turning Words Into Numbers

Computers don't understand words. They understand numbers. So the very first thing a Transformer does is convert your text into a format it can work with.

This happens in four sub-steps:

Tokenization

Your input gets chopped into tokens— usually words, but sometimes parts of words. For example, the word "empowers" might become two tokens:emp and owers. GPT-2 has a vocabulary of 50,257 unique tokens. Every possible piece of text maps to one of these.

Token Embedding

Each token gets looked up in a giant table and mapped to a 768-dimensional vector(for GPT-2 small). Think of it as coordinates in a 768-dimensional space. Words with similar meanings end up near each other. "King" is close to "Queen," "dog" is close to "puppy."

Positional Encoding

Here's a problem: unlike humans reading left to right, the Transformer sees all tokens at once. It has no built-in sense of order. So we adda positional encoding — a unique pattern for each position — so the model knows that "the cat sat" is different from "sat the cat."

Final Embedding

We simply add the token embedding and the positional encoding together. The result? A single vector per token that captures both what the word means and where it appears in the sentence.

Embedding Pipeline

See how text becomes numbers the model understands

"Data visualization empowers users to"

Start with the original prompt

Step 2: Self-Attention — The Secret Sauce

This is the core innovation. Self-attention is what makes Transformers so powerful compared to older architectures like RNNs.

Here's the intuition: when you read the sentence "The bankof the riverwas steep," you know "bank" means a riverbank, not a financial institution. How? Because you looked atthe word "river." Self-attention does exactly this — it lets each token look at every other token to figure out context.

Query, Key, Value — A Search Engine Analogy

Each token produces three vectors from its embedding:

  • Query (Q) — "What am I looking for?" Like a search query.
  • Key (K) — "What do I contain?" Like a page title in search results.
  • Value (V) — "Here's my actual content." Like the page body.

The model computes how well each Query matches each Key (dot product), then uses those scores to decide how much of each Value to incorporate. High match? Pay lots of attention. Low match? Mostly ignore it.

Multi-Head: 12 Perspectives at Once

GPT-2 doesn't run attention once — it runs 12 heads in parallel. Each head splits the 768-dim vector into a 64-dim slice and learns different things. One head might focus on syntax (subject–verb), another on coreference ("it" refers to "the dog"), another on nearby words.

Causal Masking

There's a critical constraint: when generating text, a token can only attend to tokens before it — never to future tokens. This is enforced by a causal mask that sets future attention scores to negative infinity before softmax. The model has to predict without peeking ahead.

Self-Attention Heatmap

Click a token to see what it attends to

Data
visual...
emp
owers
users
to
Data
0.25
-∞
-∞
-∞
-∞
-∞
visual...
0.14
0.33
-∞
-∞
-∞
-∞
emp
0.09
0.15
0.40
-∞
-∞
-∞
owers
0.10
0.12
0.11
0.25
-∞
-∞
users
0.17
0.05
0.04
0.16
0.42
-∞
to
0.02
0.07
0.18
0.09
0.49
0.37
High attention Low attention Masked (future)

Row = query token, Column = key token. The causal mask prevents tokens from attending to future positions (upper triangle set to -∞ before softmax).

Step 3: The MLP — Thinking It Over

After attention has routed information between tokens, each token passes through a Multi-Layer Perceptron (MLP). Think of attention as a group discussion and the MLP as individual reflection afterward.

The MLP does three things in sequence:

  1. Expand: A linear layer projects each token from 768 dimensions to 3,072 (4× expansion). This gives the model a much larger space to work in.
  2. GELU Activation:A non-linear function (GELU) is applied. This is what lets the model learn complex, non-linear patterns. Without it, stacking layers would be pointless — it'd just be one big linear transformation.
  3. Compress:Another linear layer squishes it back down to 768. The useful patterns survive; the noise doesn't.

Unlike attention, the MLP processes each token independently. It doesn't look at other tokens — it just refines each one's representation.

MLP: Expand → Activate → Compress

The feed-forward network that enriches each token

768d
Input
3072d
Expand
3072d
GELU
768d
Compress

Token representation from attention layer (768 dimensions)

Step 4: Stack It 12 Times

Here's where depth matters. GPT-2 (small) has 12 Transformer blocksstacked on top of each other. Each block is the same architecture: attention → MLP. But each block has its own learned weights.

As tokens pass through successive blocks, their representations get richer. Early layers might capture basic syntax. Middle layers capture semantic relationships. Later layers capture complex reasoning and world knowledge.

Two helper mechanisms keep training stable:

  • Residual Connections: Each block's output is added to its input. This creates a shortcut that lets gradients flow backward easily, preventing the vanishing gradient problem.
  • Layer Normalization: Applied before attention and before the MLP, it normalizes the values to have consistent mean and variance, which stabilizes training dramatically.

Step 5: Predicting the Next Token

After all 12 blocks have processed the input, we need to actually pick the next word. This happens in two steps:

  1. Linear projection: The final 768-dim vector is projected to a 50,257-dimensional vector — one score (called a logit) per token in the vocabulary.
  2. Softmax: These logits are converted to probabilities that sum to 1. Now we have a probability distribution over every possible next word.

Temperature: Controlling Creativity

Before softmax, we can divide the logits by a temperature value:

  • Temperature = 1: Default behavior. The probabilities reflect the model's true confidence.
  • Temperature < 1: Makes the distribution sharper. The model becomes more confident and predictable — almost always picking the top token.
  • Temperature > 1: Flattens the distribution. The model becomes more "creative" and unpredictable, sometimes picking surprising words.

Top-k and Top-p Sampling

Beyond temperature, you can further control randomness:

  • Top-k: Only consider the top k most likely tokens. Everything else gets zeroed out.
  • Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability exceeds p. This adapts to the distribution shape.

Temperature & Sampling

Drag the slider to see how temperature reshapes the probability distribution

0.1 (deterministic)1.02.0 (creative)
1 (greedy)8 (all)

Next token probability for "Data visualization empowers users to ___"

create
56.5%
make
18.8%
understand
13.9%
explore
5.7%
build
2.5%
see
1.4%
find
0.8%
transform
0.5%

T = 1.00: Neutral — default behavior

Putting It All Together

Let's recap the full journey of a single prediction:

  1. You type: "Data visualization empowers users to"
  2. Tokenize: Split into 6 tokens
  3. Embed: Each token → 768-dim vector + position
  4. 12 Transformer blocks: Each runs attention (12 heads) → MLP, with residual connections
  5. Project: Final vector → 50,257 logits
  6. Sample: Apply temperature, top-k, softmax → pick "create"
  7. Repeat: Append "create" to the prompt, run the whole thing again for the next token

That's it. Every time you chat with an LLM, this entire pipeline runs once for each word it generates. GPT-2 small does it with 124 million parameters. GPT-4 reportedly uses over a trillion. The architecture is the same — just bigger.

Why This Matters

Understanding Transformers isn't just academic curiosity. If you're building with AI — writing prompts, fine-tuning models, or designing AI-powered products — knowing how the model thinks helps you work with it, not against it.

Temperature too high? Your chatbot rambles. Attention heads struggling with long context? Your summarizer misses key details. The architecture explains the behavior.


Inspired by Transformer Explainer by the Polo Club of Data Science at Georgia Tech — an incredible interactive tool that runs a live GPT-2 model in your browser.