Love K. Tyagi

Prompt Caching: The $648/Month Leak in Your LLM Pipeline (And How to Plug It)

You wouldn't re-cook Thanksgiving dinner for every second helping. Stop reprocessing the same system prompt on every API call.

technicalMarch 2025·16 min read

You wouldn't re-cook an entire Thanksgiving dinner every time someone asks for a second helping of mashed potatoes. Yet that's exactly what most LLM pipelines do — reprocessing the same 50,000-token system prompt from scratch on every single API call.

This is the problem prompt caching solves. And if you're building anything production-grade with LLMs — clinical trial analyzers, agentic document processors, or even a smart chatbot — ignoring it is like running a V8 engine with the parking brake on.

Let me show you how it works, how OpenAI and Anthropic do it differently, and why your monthly API bill might be 10x higher than it needs to be.

The Analogy Everyone Gets Wrong

Most explanations compare prompt caching to "browser caching" or "a library index." Those are fine, but they miss the real mechanic at play. Here's one that's closer to the metal:

The Courtroom Stenographer

Imagine a courtroom. Every day, the judge reads the same 30-page set of procedural rules aloud before the trial begins. The lawyers sit through it. The jury sits through it. The stenographer types it all out. Every. Single. Day.

Now imagine the stenographer says: "Your Honor, I already have pages 1-30 from yesterday. They're identical. I'll start fresh from page 31."

That's prompt caching. The "procedural rules" are your system prompt, your few-shot examples, your tool definitions. The "fresh testimony" is the new user message. The stenographer doesn't need to re-transcribe what hasn't changed.

But here's the part most people miss — the savings aren't in storage. The stenographer isn't saving paper. She's saving time and cognitive effort. In LLM terms, the model isn't saving disk space; it's saving the expensive computation of converting tokens into internal representations (called KV cache entries). That computation — which involves billions of matrix multiplications across the transformer's attention layers — is what you're paying for, and it's what gets skipped on a cache hit.

The Prep Chef Analogy

Here's another one for the foodies:

A prep chef in a Michelin-star kitchen doesn't re-make the mother sauce from scratch for every plate. They make a large batch at the start of service — the mise en place — and every dish that needs it just ladles from the pre-made batch. Only the garnish and plating (the unique, per-order part) gets done fresh.

Your system prompt is the mother sauce. Your user query is the garnish. Prompt caching is the mise en place.

Without it, you're asking your chef to reduce stock from raw bones for every single order. The customer waits longer. Your gas bill explodes. Your chef burns out. Everyone loses.

How It Actually Works Under the Hood

When an LLM processes your prompt, each token passes through the model's transformer layers, generating intermediate values called Key-Value (KV) pairs. These pairs represent the model's "understanding" of each token in context. For a 100K-token prompt, this is an enormous amount of computation.

Prompt caching stores these pre-computed KV pairs so they don't need to be recalculated. When a subsequent request starts with the same prefix of tokens, the model says: "I've already computed the KV pairs for these 95,000 tokens. Let me just compute the new 500 tokens at the end."

The result:

  • Up to 90% cheaper input token costs
  • Up to 85% lower latency (time to first token)
  • Identical output quality — caching doesn't change what the model generates

Think of it as the difference between a cold start and a warm start on a car engine. Same destination, dramatically different time to get moving.

OpenAI vs. Anthropic: Two Philosophies of Caching

This is where it gets interesting. Both providers offer prompt caching, but their approaches reflect fundamentally different product philosophies — like the difference between automatic and manual transmission.

OpenAI: The Automatic Transmission

OpenAI's prompt caching is fully automatic. No code changes. No special headers. No configuration. If your prompt is ≥1,024 tokens and you send the same prefix again, OpenAI tries to route your request to a server that has the cached version.

Key characteristics:

  • Automatically enabled on GPT-4o, GPT-4o mini, o1, GPT-5, and newer
  • Minimum 1,024 tokens, cache matched in 128-token increments
  • No extra cost for cache writes
  • Cached reads are 50% off regular input pricing (some models up to 90% off)
  • Cache persists 5–10 minutes, up to 1 hour during off-peak
  • Best-effort routing — cache hits are not guaranteed (~50% hit rate in practice)

Anthropic: The Manual Transmission with Turbo

Anthropic gives you explicit control over what gets cached and when. You pay a small premium to write to the cache, but in return you get near-deterministic cache hits — 100% in controlled experiments.

Key characteristics:

  • You opt in with cache_control parameter (or automatic caching at the request level)
  • Minimum token thresholds vary by model (1,024 for Sonnet/Opus, 2,048 for Haiku)
  • Cache writes cost 1.25x base input price (5-minute TTL) or 2x (1-hour TTL)
  • Cached reads are 0.1x base input price — that's a 90% discount
  • 5-minute default TTL, refreshed on every hit
  • Predictable, deterministic cache behavior

The Trade-Off Matrix

FeatureOpenAIAnthropic
Setup effortZero (automatic)Minimal (add cache_control)
Cache write costFree1.25x base input
Cache read discount50% (up to 90%)90%
Cache hit reliability~50% (best-effort)~100% (deterministic)
Min tokens1,0241,024–4,096 (model-dependent)
TTL5–60 min (uncontrollable)5 min or 1 hour (your choice)
Best forGeneral apps, low effortHigh-volume, latency-critical apps

Bottom line: OpenAI is "set it and forget it." Anthropic is "invest a little, save a lot, and get predictable performance." If you're running a production pipeline processing thousands of requests against the same long context — like analyzing clinical trial protocols or running batch document extraction — Anthropic's deterministic caching is a significant advantage.

Let's See It in Action: A Real Example

Let's build a realistic scenario. You're running an Adverse Event Classification System that takes clinical narrative texts and classifies them according to MedDRA terminology. Your system prompt includes coding guidelines, MedDRA hierarchy definitions, and few-shot examples — easily 4,000+ tokens.

The System Prompt (Shared Across All Requests)

You are an expert pharmacovigilance AI assistant specialized in adverse event
(AE) classification using MedDRA terminology (v27.0).

CLASSIFICATION RULES:
1. Map each reported adverse event to the most specific MedDRA Preferred Term (PT)
2. Identify the corresponding System Organ Class (SOC)
3. Assess seriousness criteria per ICH E2D guidelines
4. Flag any events qualifying as SUSAR (Suspected Unexpected Serious Adverse Reaction)
5. Provide a confidence score (0.0-1.0) for each classification

SERIOUSNESS CRITERIA (per ICH E2D):
- Results in death
- Is life-threatening
- Requires inpatient hospitalization or prolongation
- Results in persistent or significant disability/incapacity
- Is a congenital anomaly/birth defect
- Is a medically important event

FEW-SHOT EXAMPLES:

Input: "Patient experienced severe headache and blurred vision 2 days after dose"
Output:
{
  "events": [
    {
      "verbatim": "severe headache",
      "preferred_term": "Headache",
      "soc": "Nervous system disorders",
      "serious": false,
      "confidence": 0.95
    },
    {
      "verbatim": "blurred vision",
      "preferred_term": "Vision blurred",
      "soc": "Eye disorders",
      "serious": false,
      "confidence": 0.92
    }
  ]
}

Input: "Subject hospitalized due to anaphylactic shock following infusion"
Output:
{
  "events": [
    {
      "verbatim": "anaphylactic shock",
      "preferred_term": "Anaphylactic shock",
      "soc": "Immune system disorders",
      "serious": true,
      "seriousness_criteria": ["requires_hospitalization", "life_threatening"],
      "susar_flag": true,
      "confidence": 0.98
    }
  ]
}

[... 8 more few-shot examples covering edge cases ...]

Always respond in valid JSON. Do not include explanations outside the JSON structure.

This system prompt is ~4,000 tokens. Now imagine you're classifying 500 adverse event narratives from a Phase III trial. Without caching, you're reprocessing 4,000 tokens × 500 requests = 2,000,000 redundant input tokens.

OpenAI Implementation (Automatic — No Changes Needed)

from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = """You are an expert pharmacovigilance AI assistant..."""  # ~4,000 tokens

ae_narratives = [
    "Patient reported persistent nausea and Grade 2 diarrhea starting Day 3 of Cycle 2",
    "Subject developed maculopapular rash on trunk, resolved with topical steroids",
    # ... 498 more narratives
]

for narrative in ae_narratives:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Classify: {narrative}"}
        ]
    )

    # Check cache status in the response
    usage = response.usage
    print(f"Total tokens: {usage.prompt_tokens}")
    print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")
    # First request:  cached_tokens = 0     (cold start)
    # Later requests: cached_tokens ≈ 3,968 (nearest 128-token boundary)

What happens behind the scenes:

  • Request 1: Full processing of all ~4,000 system tokens. Cache miss.
  • Request 2+: OpenAI attempts to route to the same server. If successful, ~3,968 tokens are served from cache, and only the new user message tokens are computed fresh.
  • Hit rate in practice: ~50% due to best-effort routing.

Anthropic Implementation (Explicit Control)

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are an expert pharmacovigilance AI assistant..."""  # ~4,000 tokens

ae_narratives = [
    "Patient reported persistent nausea and Grade 2 diarrhea starting Day 3 of Cycle 2",
    "Subject developed maculopapular rash on trunk, resolved with topical steroids",
    # ... 498 more narratives
]

for narrative in ae_narratives:
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        cache_control={"type": "ephemeral"},     # Automatic caching (simplest)
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"Classify: {narrative}"}
        ]
    )

    # Anthropic gives you explicit cache metrics
    print(f"Input tokens: {response.usage.input_tokens}")
    print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
    print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
    # Request 1: cache_creation = 4,000 | cache_read = 0       (writing to cache)
    # Request 2: cache_creation = 0     | cache_read = 4,000   (reading from cache)

What happens behind the scenes:

  • Request 1: Full processing + cache write. You pay 1.25x on the system prompt tokens.
  • Request 2+: System prompt served from cache at 0.1x cost. Deterministic. Every time.
  • Hit rate: ~100% within the 5-minute TTL window.

The Numbers Don't Lie: Cost & Latency Comparison

Let's do the math for our 500-request AE classification batch using Claude Sonnet 4.5 pricing as a reference.

Assumptions

  • System prompt: 4,000 tokens (stable across all requests)
  • User message: 200 tokens per request (unique per AE narrative)
  • Output: 300 tokens per response
  • 500 total requests, processed within the 5-minute cache window

Without Prompt Caching

Input cost per request:  4,200 tokens × $3.00/1M = $0.0126
Output cost per request: 300 tokens × $15.00/1M  = $0.0045
Total per request:       $0.0171
───────────────────────────────────────────────
Total for 500 requests:  $8.55
Total input tokens billed: 2,100,000

With Anthropic Prompt Caching

Request 1 (cache write):
  Cache write:  4,000 tokens × $3.75/1M  = $0.0150  (1.25x base)
  Fresh input:  200 tokens × $3.00/1M    = $0.0006
  Output:       300 tokens × $15.00/1M   = $0.0045
  Subtotal:     $0.0201

Requests 2–500 (cache read, × 499 requests):
  Cache read:   4,000 tokens × $0.30/1M  = $0.0012  (0.1x base — 90% savings)
  Fresh input:  200 tokens × $3.00/1M    = $0.0006
  Output:       300 tokens × $15.00/1M   = $0.0045
  Subtotal:     $0.0063 × 499 = $3.1437

───────────────────────────────────────────────
Total for 500 requests:  $3.16
Effective input tokens billed: ~298,000 (equivalent)

The Savings

MetricWithout CachingWith CachingSavings
Total cost$8.55$3.1663% reduction
Input token cost$6.30$0.8187% reduction
Latency (est.)~2.5s TTFT~0.4s TTFT~85% faster

At scale — say 10,000 requests per day — this becomes $171/day vs. $63/day. That's $3,240 saved per month on input costs alone.

OpenAI Comparison (Same Scenario)

With OpenAI's automatic caching (GPT-4o, assuming ~50% cache hit rate):

~250 cache hits:  250 × (2,100 × $1.25/1M + 300 × $5.00/1M)  = $0.70
~250 cache misses: 250 × (4,200 × $2.50/1M + 300 × $10.00/1M) = $3.38
───────────────────────────────────────────────
Total: ~$4.08 (estimates — actual pricing varies by model version)

OpenAI's savings are real but less predictable. The inconsistent hit rate makes it harder to forecast costs — a meaningful consideration in regulated environments where budget predictability matters.

Five Rules for Maximizing Cache Hits

Regardless of which provider you use, these principles apply:

1. Front-load the stable stuff. System prompts, tool definitions, few-shot examples, and reference documents go at the beginning of your prompt. Dynamic user input goes at the end. Caching works on prefixes — if you put the user message before the system prompt, nothing caches.

2. Don't personalize the prefix. If you inject the user's name or account ID into the system prompt, every user gets a unique prefix and no one hits the cache. Push personalization into the user message instead.

3. Respect the minimum thresholds. OpenAI needs 1,024 tokens. Anthropic's Claude Haiku 4.5 needs 4,096. If your system prompt is 500 tokens, caching won't activate. Consider padding with useful few-shot examples to cross the threshold.

4. Keep the TTL warm. Anthropic's 5-minute default means you need at least one request every 5 minutes to keep the cache alive. For batch processing, fire requests in a tight loop rather than spreading them out over an hour.

5. Version your prompts carefully. One changed character in your system prompt invalidates the entire cache. Treat prompt updates like code deployments — deliberate, versioned, and tested.

When Caching Won't Help

Be honest about the limitations:

  • Unique prompts every time: If every request has a completely different system prompt, there's nothing to cache.
  • Very short prompts: Under the minimum token threshold, caching doesn't activate.
  • Infrequent requests: If you're sending 1 request per hour, the cache will expire between calls.
  • Rapidly evolving context: If your system prompt changes every few minutes (like a live dashboard summary), the cache churn costs more than it saves.

The Bottom Line

Prompt caching isn't a nice-to-have optimization anymore. It's table stakes for anyone running LLMs at production scale. The providers have made it almost embarrassingly easy:

  • OpenAI: Do literally nothing. It's automatic.
  • Anthropic: Add one field (cache_control) and enjoy 90% savings with 100% hit reliability.

If you're currently sending the same system prompt on every request without caching — or worse, if you're building agentic systems that make dozens of API calls per task, each carrying the same tool definitions and context — you're leaving real money and real performance on the table.

Stop re-cooking the Thanksgiving turkey. Mise en place your prompts. Your API bill (and your users' patience) will thank you.