KV Cache Explained

Understand how KV caching makes LLM inference fast and cheap — transformer attention mechanics, prompt prefix caching, Anthropic vs OpenAI implementations, PagedAttention, and GPU memory optimization with interactive demos.

By Visual Explainer35 min readIntermediateInteractive Demo
KV Cache Explained

Why Your Second Message Is Faster

When you send a second message in a conversation, it often arrives much faster than the first. This isn't magic — it's KV caching. The model skips recomputing attention for tokens it has already processed, saving both time and money.

The Race: Cold vs Warm Request

Watch both requests process simultaneously. The KV cache makes the second request skip recomputation.

First message (cold)0.0s
2.1s0.8s1.1s
Second message (warm)0.0s
0.0s0.8s1.1s

Cold Request Cost

$0.042

Full computation

Warm Request Cost

$0.005

88% cheaper with cache

Scale Impact Calculator

See how much KV caching saves at scale.

System Prompt (tokens)4,000
Daily Active Users10,000
Price ($/1M input tokens)$3.0

Monthly Cost Without Caching

$75,600

6.0M requests × 4200 tokens

Monthly Cost With Caching

$3,960

Cache read at 10% cost

You Save

$71,640/month

$859,680/year

How Transformers Compute Attention

At the heart of every LLM is the attention mechanism. Each new token must attend to ALL previous tokens — computing Query, Key, and Value vectors. Without caching, this means recomputing K and V for every previous token on every generation step. The KV cache stores these vectors so they're only computed once.

Transformer Attention with KV Cache

Step through token generation. See how the KV cache avoids recomputing previous tokens.

Transformer Block — Step 1/6

Input Tokens

The

Q (Query) — new token only

The

K (Keys) — all tokens

K0

V (Values) — all tokens

V0

Attention: Q[The] attends to all Keys

The (40%)

Output: weighted sum of Values → next token prediction

KV Cache Buffer

K0
V0
The

✅ Only K0/V0 computed. 0 pairs reused from cache.

Token 1 of 6

KV Cache Size Calculator

Calculate how much GPU memory the KV cache consumes for a given model architecture.

Layers: 32
Attention Heads: 32
Head Dimension: 128
Sequence Length: 4,096
Precision

Formula

2 × 32 layers × 32 heads × 128 dim × 4,096 seq × 2 bytes
= 2.15 GB per request
GPU Memory (A100 80GB)2.00 GB / 80 GB
KV Cache: 2.5% of GPU memory

Prompt Prefix Caching — How It Works

Prompt prefix caching takes KV caching to the API level. If multiple requests share the same system prompt or context prefix, the provider caches those KV vectors server-side. Subsequent requests skip recomputing the shared prefix entirely — paying only 10% of the normal cost for cached tokens.

Two-Request Sequence

Watch how the second request skips recomputing the cached prefix — only the user message is processed.

Request 1
System Prompt
Retrieved Docs
User Message
2,000 tokens5,000 tokens80 tokens
Request 2
System Prompt
Retrieved Docs
User Message

Cache Breakpoint Optimizer

Drag the breakpoint to optimize cache placement. Static content goes left, dynamic content stays right.

System Prompt
Few-Shot Examples
Retrieved Context
User Message
✂️ breakpoint
CACHED (4,956 tokens)

Cache Savings

63%

Dynamic (per request)

2,124

tokens computed fresh

API Implementation

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful coding assistant...",
            "cache_control": {"type": "ephemeral"}  # ← Cache this
        }
    ],
    messages=[
        {"role": "user", "content": "Explain KV caching"}
    ]
)

# Check cache usage in response
print(response.usage)
# cache_creation_input_tokens: 2048  (first call)
# cache_read_input_tokens: 2048     (subsequent calls)
# input_tokens: 15                   (uncached tokens)

Usage Breakdown (Pie Chart)

cache_creation_input_tokens2,048
cache_read_input_tokens2,048
input_tokens15

On cache hit: 2048 tokens read at 10% cost = $0.0006 instead of $0.006

Anthropic vs OpenAI Cache Implementation

Both major providers offer prompt caching, but with fundamentally different approaches. Anthropic gives you explicit control with cache_control markers and 90% savings. OpenAI automatically detects repeated prefixes with 50% savings. The right choice depends on your usage pattern.

Anthropic vs OpenAI — Cache Implementation

Not a simple table — a visual feature-by-feature comparison.

Anthropic

Explicit Cache Control

OpenAI

Automatic Prefix Caching

Activation Method
🏷️

Explicit cache_control marker

You choose exactly what to cache

🤖

Automatic prefix detection

System detects repeated prefixes

Minimum Cacheable Length

Any length

No minimum — even short prompts cacheable

⚠️

1,024 token minimum

Short prompts cannot be cached

Cache Duration
⏱️

5 minutes, refreshes on hit

Predictable TTL, resets each use

🔮

Minutes to hours (variable)

No guaranteed TTL, opaque behavior

Read Cost Discount
💰

90% savings (10% of normal)

Pay only 10¢ per $1 of cached tokens

💵

50% savings (50% of normal)

Pay 50¢ per $1 of cached tokens

Cache Visibility

Explicit token counts in response

cache_read_input_tokens in API response

Inferred from pricing only

No explicit cache metrics in response

Max Breakpoints
4️⃣

4 breakpoints per request

Fine-grained control over cache segments

1️⃣

1 implicit prefix

Only the longest common prefix is cached

Write Cost
📝

25% premium on first write

Amortized over subsequent reads

🆓

No write premium

Cache writes are free

🔑 Key Takeaway: Anthropic gives you explicit control with 90% savings and full visibility. OpenAI is zero-effort with 50% savings but less control. For high-volume applications with stable system prompts, Anthropic's explicit caching saves significantly more.

Which Strategy Saves More?

Static Prefix (tokens)5,000
Dynamic Suffix (tokens)500
Requests/Day10,000
Model Tier

Anthropic (Explicit Cache)

$900

/month

OpenAI (Auto Cache)

$900

/month

✅ Cheaper

Difference

$0/month saved with OpenAI

KV Cache in Inference Infrastructure

At the infrastructure level, the KV cache is the #1 memory bottleneck for serving LLMs. A single long-context request can consume gigabytes of GPU memory. PagedAttention (used by vLLM) solves this with virtual-memory-style paging — eliminating fragmentation and serving 40% more concurrent requests.

GPU Memory Partition (A100 80GB)

As sequence length grows, KV cache consumes more memory — reducing concurrency.

Sequence Length:4,096 tokens
Model Weights
KV Cache (4,096 tokens)
Free
Model Weights: 35.0GB
KV Cache (4,096 tokens): 20.5GB
Activations: 2.0GB
Overhead: 1.0GB
Free: 21.5GB

KV Cache per Request

20.5 GB

Max Concurrent Requests

1

(21.5GB free ÷ 20.5GB per req)

⚠️ KV cache consuming 26% of GPU memory — severe concurrency bottleneck!

PagedAttention (vLLM)

Traditional KV cache uses contiguous blocks with fragmentation waste. PagedAttention uses fixed-size pages — like virtual memory for attention.

Requests:4

❌ Without PagedAttention (Contiguous)

Req 1
waste
Req 2
waste
Req 3
waste
Req 4
waste
Fragmentation: ~35% wasted

✅ With PagedAttention (Paged)

Req 1 (4 pages)
Req 2 (5 pages)
Req 3 (3 pages)
Req 4 (8 pages)
No fragmentation!

Result

~40% more requests served with same memory

4 → ~5 concurrent requests

Inference Server Capacity Planner

GPU Memory (GB)80GB
Model Size (B params)
Avg Seq Length: 4,096
Target Concurrency: 32
Model 68GB
KV
Act
Free

Without PagedAttention

1

max concurrent requests

With PagedAttention

1

max concurrent requests

⚠️ Target concurrency (32) exceeds capacity (1). Consider: shorter sequences, smaller model, or more GPU memory.