KV Cache Explained
Understand how KV caching makes LLM inference fast and cheap — transformer attention mechanics, prompt prefix caching, Anthropic vs OpenAI implementations, PagedAttention, and GPU memory optimization with interactive demos.
Why Your Second Message Is Faster
When you send a second message in a conversation, it often arrives much faster than the first. This isn't magic — it's KV caching. The model skips recomputing attention for tokens it has already processed, saving both time and money.
The Race: Cold vs Warm Request
Watch both requests process simultaneously. The KV cache makes the second request skip recomputation.
Cold Request Cost
$0.042
Full computation
Warm Request Cost
$0.005
88% cheaper with cache
Scale Impact Calculator
See how much KV caching saves at scale.
Monthly Cost Without Caching
$75,600
6.0M requests × 4200 tokens
Monthly Cost With Caching
$3,960
Cache read at 10% cost
You Save
$71,640/month
$859,680/year
How Transformers Compute Attention
At the heart of every LLM is the attention mechanism. Each new token must attend to ALL previous tokens — computing Query, Key, and Value vectors. Without caching, this means recomputing K and V for every previous token on every generation step. The KV cache stores these vectors so they're only computed once.
Transformer Attention with KV Cache
Step through token generation. See how the KV cache avoids recomputing previous tokens.
Transformer Block — Step 1/6
Input Tokens
Q (Query) — new token only
K (Keys) — all tokens
V (Values) — all tokens
Attention: Q[The] attends to all Keys
Output: weighted sum of Values → next token prediction
KV Cache Buffer
✅ Only K0/V0 computed. 0 pairs reused from cache.
KV Cache Size Calculator
Calculate how much GPU memory the KV cache consumes for a given model architecture.
Formula
2 × 32 layers × 32 heads × 128 dim × 4,096 seq × 2 bytes = 2.15 GB per request
Prompt Prefix Caching — How It Works
Prompt prefix caching takes KV caching to the API level. If multiple requests share the same system prompt or context prefix, the provider caches those KV vectors server-side. Subsequent requests skip recomputing the shared prefix entirely — paying only 10% of the normal cost for cached tokens.
Two-Request Sequence
Watch how the second request skips recomputing the cached prefix — only the user message is processed.
Cache Breakpoint Optimizer
Drag the breakpoint to optimize cache placement. Static content goes left, dynamic content stays right.
Cache Savings
63%
Dynamic (per request)
2,124
tokens computed fresh
API Implementation
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful coding assistant...",
"cache_control": {"type": "ephemeral"} # ← Cache this
}
],
messages=[
{"role": "user", "content": "Explain KV caching"}
]
)
# Check cache usage in response
print(response.usage)
# cache_creation_input_tokens: 2048 (first call)
# cache_read_input_tokens: 2048 (subsequent calls)
# input_tokens: 15 (uncached tokens)Usage Breakdown (Pie Chart)
On cache hit: 2048 tokens read at 10% cost = $0.0006 instead of $0.006
Anthropic vs OpenAI Cache Implementation
Both major providers offer prompt caching, but with fundamentally different approaches. Anthropic gives you explicit control with cache_control markers and 90% savings. OpenAI automatically detects repeated prefixes with 50% savings. The right choice depends on your usage pattern.
Anthropic vs OpenAI — Cache Implementation
Not a simple table — a visual feature-by-feature comparison.
Anthropic
Explicit Cache Control
OpenAI
Automatic Prefix Caching
Explicit cache_control marker
You choose exactly what to cache
Automatic prefix detection
System detects repeated prefixes
Any length
No minimum — even short prompts cacheable
1,024 token minimum
Short prompts cannot be cached
5 minutes, refreshes on hit
Predictable TTL, resets each use
Minutes to hours (variable)
No guaranteed TTL, opaque behavior
90% savings (10% of normal)
Pay only 10¢ per $1 of cached tokens
50% savings (50% of normal)
Pay 50¢ per $1 of cached tokens
Explicit token counts in response
cache_read_input_tokens in API response
Inferred from pricing only
No explicit cache metrics in response
4 breakpoints per request
Fine-grained control over cache segments
1 implicit prefix
Only the longest common prefix is cached
25% premium on first write
Amortized over subsequent reads
No write premium
Cache writes are free
🔑 Key Takeaway: Anthropic gives you explicit control with 90% savings and full visibility. OpenAI is zero-effort with 50% savings but less control. For high-volume applications with stable system prompts, Anthropic's explicit caching saves significantly more.
Which Strategy Saves More?
Anthropic (Explicit Cache)
$900
/month
OpenAI (Auto Cache)
$900
/month
✅ CheaperDifference
$0/month saved with OpenAI
KV Cache in Inference Infrastructure
At the infrastructure level, the KV cache is the #1 memory bottleneck for serving LLMs. A single long-context request can consume gigabytes of GPU memory. PagedAttention (used by vLLM) solves this with virtual-memory-style paging — eliminating fragmentation and serving 40% more concurrent requests.
GPU Memory Partition (A100 80GB)
As sequence length grows, KV cache consumes more memory — reducing concurrency.
KV Cache per Request
20.5 GB
Max Concurrent Requests
1
(21.5GB free ÷ 20.5GB per req)
⚠️ KV cache consuming 26% of GPU memory — severe concurrency bottleneck!
PagedAttention (vLLM)
Traditional KV cache uses contiguous blocks with fragmentation waste. PagedAttention uses fixed-size pages — like virtual memory for attention.
❌ Without PagedAttention (Contiguous)
✅ With PagedAttention (Paged)
Result
~40% more requests served with same memory
4 → ~5 concurrent requests
Inference Server Capacity Planner
Without PagedAttention
1
max concurrent requests
With PagedAttention
1
max concurrent requests
⚠️ Target concurrency (32) exceeds capacity (1). Consider: shorter sequences, smaller model, or more GPU memory.