KV Cache Explained

Understand how KV caching makes LLM inference fast and cheap — transformer attention mechanics, prompt prefix caching, Anthropic vs OpenAI implementations, PagedAttention, and GPU memory optimization with interactive demos.

By Visual Explainer35 min readIntermediateInteractive Demo
KV Cache Explained

Why Your Second Message Is Faster

You've probably noticed this already: the first message in a conversation with Claude or ChatGPT takes a beat longer than the ones that follow. Your second, third, and fourth messages feel noticeably snappier. Most people chalk this up to network conditions or server load. But it's actually something far more interesting — a technique called KV caching that fundamentally changes the economics of running large language models.

Here's the short version: when a transformer processes your first message, it computes mathematical representations (called Key and Value vectors) for every single token in the conversation — your system prompt, any context, and your message. This is expensive work. But those vectors don't change once computed. So the model stores them. When your second message arrives, it doesn't redo all that math — it just looks up the cached vectors and only computes the new tokens. The result? 2-10x faster responses and up to 90% cost reduction on cached portions.

This isn't a minor optimization. At Anthropic's and OpenAI's scale — serving millions of requests per minute — KV caching is the difference between viable and unviable economics. And for developers building on these APIs, understanding it unlocks dramatic cost savings. Let's start by watching it in action.

The Race: Cold vs Warm Request

Watch both requests process simultaneously. The KV cache makes the second request skip recomputation.

First message (cold)0.0s
2.1s0.8s1.1s
Second message (warm)0.0s
0.0s0.8s1.1s

Cold Request Cost

$0.042

Full computation

Warm Request Cost

$0.005

88% cheaper with cache

Scale Impact Calculator

See how much KV caching saves at scale.

System Prompt (tokens)4,000
Daily Active Users10,000
Price ($/1M input tokens)$3.0

Monthly Cost Without Caching

$75,600

6.0M requests × 4200 tokens

Monthly Cost With Caching

$3,960

Cache read at 10% cost

You Save

$71,640/month

$859,680/year

How Transformers Compute Attention

To truly understand why KV caching works — and why it's so effective — you need to understand what transformers are actually doing under the hood. Don't worry, we'll build this up from first principles and you'll see why the “Key” and “Value” in “KV cache” refer to specific mathematical objects in the attention computation.

Every transformer layer performs the same fundamental operation: for each new token being generated, it asks “how much should I pay attention to each previous token?” The answer comes from computing three vectors per token — Query (what am I looking for?), Key (what do I contain?), and Value (what information do I carry?). The Query of the new token is compared against the Keys of all previous tokens to produce attention weights. Those weights are then used to create a weighted sum of the Values.

Here's the critical insight: the Keys and Values of previous tokens never change. Token #3's Key vector is identical whether we're generating token #4 or token #400. So why would we recompute it every single time? We wouldn't — and that's exactly what the KV cache prevents. Each token's K and V vectors are computed once and stored. Only the new token needs fresh computation. Step through the interactive diagram below to see this in real time.

Transformer Attention with KV Cache

Step through token generation. See how the KV cache avoids recomputing previous tokens.

Transformer Block — Step 1/6

Input Tokens

The

Q (Query) — new token only

The

K (Keys) — all tokens

K0

V (Values) — all tokens

V0

Attention: Q[The] attends to all Keys

The (40%)

Output: weighted sum of Values → next token prediction

KV Cache Buffer

K0
V0
The

✅ Only K0/V0 computed. 0 pairs reused from cache.

Token 1 of 6

KV Cache Size Calculator

Calculate how much GPU memory the KV cache consumes for a given model architecture.

Layers: 32
Attention Heads: 32
Head Dimension: 128
Sequence Length: 4,096
Precision

Formula

2 × 32 layers × 32 heads × 128 dim × 4,096 seq × 2 bytes
= 2.15 GB per request
GPU Memory (A100 80GB)2.00 GB / 80 GB
KV Cache: 2.5% of GPU memory

Now you can see the math: without a cache, generating token N requires computing 2×N×L×H×D floating point operations (where L is layers, H is heads, D is head dimension). With a cache, it's just 2×1×L×H×D — only the new token. For a 100-token generation on a 32-layer model, that's a 100x reduction in redundant computation. But this local, per-request caching is just the beginning. What if multiple API requests share the same prefix?

Prompt Prefix Caching — The API-Level Game Changer

Here's where things get really interesting — and where your API bills start dropping dramatically. Think about a typical production application: every request to your AI customer service agent starts with the same 4,000-token system prompt. Every RAG query includes the same few-shot examples. Every code assistant embeds the same repository context. That's thousands of tokens being reprocessed from scratch on every single request, costing you the same amount each time.

Prompt prefix caching takes the per-request KV cache concept and elevates it to the infrastructure level. When Anthropic or OpenAI detects that your request shares a prefix with a recent previous request (same system prompt, same context), they skip recomputing those KV vectors entirely. The cached portion is read from server-side storage at a fraction of the cost —90% cheaper with Anthropic, 50% cheaper with OpenAI.

The implications are massive. If your system prompt is 4,000 tokens and your user message is 100 tokens, prefix caching means you only pay full price for those 100 new tokens. The 4,000-token prefix? Cached and served at 10% cost. For a business making 100,000 API calls per day, this is the difference between a $30,000/month bill and a $4,000/month bill. Let's see exactly how this works.

Two-Request Sequence

Watch how the second request skips recomputing the cached prefix — only the user message is processed.

Request 1
System Prompt
Retrieved Docs
User Message
2,000 tokens5,000 tokens80 tokens
Request 2
System Prompt
Retrieved Docs
User Message

Cache Breakpoint Optimizer

Drag the breakpoint to optimize cache placement. Static content goes left, dynamic content stays right.

System Prompt
Few-Shot Examples
Retrieved Context
User Message
✂️ breakpoint
CACHED (4,956 tokens)

Cache Savings

63%

Dynamic (per request)

2,124

tokens computed fresh

API Implementation

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful coding assistant...",
            "cache_control": {"type": "ephemeral"}  # ← Cache this
        }
    ],
    messages=[
        {"role": "user", "content": "Explain KV caching"}
    ]
)

# Check cache usage in response
print(response.usage)
# cache_creation_input_tokens: 2048  (first call)
# cache_read_input_tokens: 2048     (subsequent calls)
# input_tokens: 15                   (uncached tokens)

Usage Breakdown (Pie Chart)

cache_creation_input_tokens2,048
cache_read_input_tokens2,048
input_tokens15

On cache hit: 2048 tokens read at 10% cost = $0.0006 instead of $0.006

The key architectural insight here is that caching only works for prefixes— the static portion at the beginning of your prompt. The moment content diverges from the cached version (your dynamic user message), caching stops. This means prompt engineering for cache efficiency is about putting stable content first and variable content last. System prompt → few-shot examples → retrieved context → user message. Get this ordering right and you're saving 80-90% on the bulk of your tokens. But Anthropic and OpenAI implement this very differently — let's compare.

Anthropic vs OpenAI — Two Philosophies of Caching

Both major providers offer prompt caching, but their implementations reflect fundamentally different design philosophies. Anthropic says: “You know your application best — tell us exactly what to cache, and we'll give you 90% savings and full visibility.” OpenAI says: “Don't worry about it — we'll automatically detect repeated prefixes and give you 50% savings with zero configuration.”

Neither approach is universally better — it depends on your use case. If you have a stable, predictable system prompt and high request volume, Anthropic's explicit caching saves dramatically more money. If your prompts vary frequently or you want zero engineering overhead, OpenAI's automatic approach is simpler. The comparison below breaks down every dimension — cost, control, TTL, visibility — so you can make an informed choice for your specific architecture.

Anthropic vs OpenAI — Cache Implementation

Not a simple table — a visual feature-by-feature comparison.

Anthropic

Explicit Cache Control

OpenAI

Automatic Prefix Caching

Activation Method
🏷️

Explicit cache_control marker

You choose exactly what to cache

🤖

Automatic prefix detection

System detects repeated prefixes

Minimum Cacheable Length

Any length

No minimum — even short prompts cacheable

⚠️

1,024 token minimum

Short prompts cannot be cached

Cache Duration
⏱️

5 minutes, refreshes on hit

Predictable TTL, resets each use

🔮

Minutes to hours (variable)

No guaranteed TTL, opaque behavior

Read Cost Discount
💰

90% savings (10% of normal)

Pay only 10¢ per $1 of cached tokens

💵

50% savings (50% of normal)

Pay 50¢ per $1 of cached tokens

Cache Visibility

Explicit token counts in response

cache_read_input_tokens in API response

Inferred from pricing only

No explicit cache metrics in response

Max Breakpoints
4️⃣

4 breakpoints per request

Fine-grained control over cache segments

1️⃣

1 implicit prefix

Only the longest common prefix is cached

Write Cost
📝

25% premium on first write

Amortized over subsequent reads

🆓

No write premium

Cache writes are free

🔑 Key Takeaway: Anthropic gives you explicit control with 90% savings and full visibility. OpenAI is zero-effort with 50% savings but less control. For high-volume applications with stable system prompts, Anthropic's explicit caching saves significantly more.

Which Strategy Saves More?

Static Prefix (tokens)5,000
Dynamic Suffix (tokens)500
Requests/Day10,000
Model Tier

Anthropic (Explicit Cache)

$900

/month

OpenAI (Auto Cache)

$900

/month

✅ Cheaper

Difference

$0/month saved with OpenAI

One pattern I've seen work well in production: use Anthropic's explicit caching for your high-volume, latency-sensitive endpoints (customer-facing chat, code completion) where the savings compound dramatically. Use OpenAI for exploratory, variable workloads (internal tools, one-off analysis) where the zero-configuration ease matters more than maximum savings. But regardless of which provider you choose, both face the same fundamental infrastructure challenge: the KV cache is a memory monster. Let's look at what happens inside the GPU.

GPU Infrastructure — The Memory Bottleneck

Here's a fact that surprises most developers: the KV cache — not the model weights — is the #1 memory bottleneck when serving LLMs in production. A 70B parameter model takes ~140GB in FP16 (spread across multiple GPUs). But the KV cache for a single 128K-token request on that model? Over 40GB of GPU memory. For one request. Now imagine serving 32 concurrent users. The math doesn't work without clever engineering.

This is why frameworks like vLLM and TensorRT-LLM exist. They solve the KV cache memory problem using a technique called PagedAttention— inspired by how operating systems manage virtual memory. Instead of allocating one giant contiguous block per request (which wastes memory through fragmentation), PagedAttention breaks the cache into fixed-size pages that can be allocated on demand and shared across requests. The result: 40% more concurrent users on the same hardware, with zero quality loss. If you're self-hosting models or evaluating inference providers, understanding this layer is essential.

GPU Memory Partition (A100 80GB)

As sequence length grows, KV cache consumes more memory — reducing concurrency.

Sequence Length:4,096 tokens
Model Weights
KV Cache (4,096 tokens)
Free
Model Weights: 35.0GB
KV Cache (4,096 tokens): 20.5GB
Activations: 2.0GB
Overhead: 1.0GB
Free: 21.5GB

KV Cache per Request

20.5 GB

Max Concurrent Requests

1

(21.5GB free ÷ 20.5GB per req)

⚠️ KV cache consuming 26% of GPU memory — severe concurrency bottleneck!

PagedAttention (vLLM)

Traditional KV cache uses contiguous blocks with fragmentation waste. PagedAttention uses fixed-size pages — like virtual memory for attention.

Requests:4

❌ Without PagedAttention (Contiguous)

Req 1
waste
Req 2
waste
Req 3
waste
Req 4
waste
Fragmentation: ~35% wasted

✅ With PagedAttention (Paged)

Req 1 (4 pages)
Req 2 (5 pages)
Req 3 (3 pages)
Req 4 (8 pages)
No fragmentation!

Result

~40% more requests served with same memory

4 → ~5 concurrent requests

Inference Server Capacity Planner

GPU Memory (GB)80GB
Model Size (B params)
Avg Seq Length: 4,096
Target Concurrency: 32
Model 68GB
KV
Act
Free

Without PagedAttention

1

max concurrent requests

With PagedAttention

1

max concurrent requests

⚠️ Target concurrency (32) exceeds capacity (1). Consider: shorter sequences, smaller model, or more GPU memory.

Putting It All Together

Let's zoom out and connect everything we've covered. At the math level, KV caching avoids redundant computation by storing Key and Value vectors that never change. At the API level, prefix caching extends this across requests — shared system prompts are computed once and reused thousands of times. At the infrastructure level, PagedAttention solves the memory fragmentation problem that would otherwise limit concurrency.

Together, these three layers make large language models economically viable to serve. Without per-request KV caching, generation would be 100x slower. Without prefix caching, API costs would be 5-10x higher for production workloads. Without PagedAttention, inference providers would need 3-5x more GPUs to serve the same traffic. Understanding this stack — from attention math to GPU memory management — makes you a fundamentally stronger AI engineer. You now know not just what to cache, but why it works, how to optimize it, and where the real bottlenecks live.