Prompt Caching Cookbook: Every Provider's Syntax & Gotchas (2026)

Prompt caching is the single biggest cost lever in LLM applications, and almost every team we audit is leaving 30–60% of the savings on the table. The idea is simple: if you send the same static prefix (a system prompt, a document, a tool schema) on many calls, the provider stores it and bills repeat reads at roughly 10% of the input rate. The execution is where teams trip. Here's the working syntax for each provider, the minimums that actually save money, and the gotchas. To see the effect on your own prompt, use the calculator.

How much it saves

Cached input is billed at ~10% of the standard input rate. For a chat app with a 2,000-token system prompt hit on every turn, caching that prefix cuts input cost by 80–90% — which for most chat products is a 40–60% cut in the total bill. The longer and more-reused your prefix, the bigger the win.

Anthropic (Claude)

Anthropic uses explicit cache breakpoints. You mark the end of the cacheable prefix with cache_control:

"messages": [
  { "role": "system",
    "content": [
      { "type": "text", "text": "<your long static system prompt>",
        "cache_control": { "type": "ephemeral" } }
    ]
  }
]

Everything before the breakpoint is cached.
Minimum cacheable prefix is ~1,024 tokens (smaller models a bit higher).
Default cache lifetime (TTL) is ~5 minutes, refreshed on each hit; a longer TTL tier exists.
Put the most static content first (system prompt, then documents, then the variable user turn last) so the cacheable prefix is as long as possible.

See live rates for Claude models.

OpenAI (GPT)

OpenAI auto-caches prefixes — no special parameter:

Prefixes over 1,024 tokens are cached automatically; cache hits bill at a reduced rate.
Caching keys on the exact prefix, so keep the static part byte-identical and place variable content at the end.
No manual breakpoints, but also less control — structure your prompt static-first to maximize hits.

Google (Gemini)

Google requires an explicit context-caching API call: you create a cached content object, get a handle, and reference it on subsequent requests.

Best for large, reused contexts (long documents, big system instructions).
You manage the cache lifecycle (create, TTL, delete) yourself.
There's a small storage cost for the cached content — worth it only when reuse is high.

The gotchas that waste the discount

Variable content in the prefix. A timestamp, a session ID, or a reordered tool list at the top breaks the cache every call. Keep the prefix byte-identical.
TTL expiry. On Anthropic, a 5-minute idle gap drops the cache; the next call pays full price and re-caches. High-traffic endpoints benefit most.
Too-short prefixes. Below ~1,024 tokens nothing caches. Consolidate small system messages into one substantial prefix.
Cache misses on cold start. The first call always pays full price + a small write cost. Caching pays off on reuse, not one-offs.
Putting the user message first. Always: static system prompt → documents → variable user turn, in that order.

Stack it with the other levers

Caching combines with the rest of the cost playbook: cap output (max_tokens), batch deferrable jobs (50% off), and route easy work to a cheaper model. Caching + batching + routing together routinely cut a production bill by 70%+.

FAQ

How much does prompt caching save?

Cached input is billed at ~10% of the standard rate — typically 40–60% off the total bill for chat apps with a reused system prompt.

Do OpenAI, Anthropic and Google all support caching?

Yes. OpenAI auto-caches prefixes over ~1,024 tokens, Anthropic uses explicit cache_control breakpoints, and Google uses an explicit context-caching API.

What's the minimum prefix length that caches?

Around 1,024 tokens for OpenAI and Anthropic. Below that, nothing is cached.

Why isn't my cache hitting?

Usually variable content in the prefix (timestamps, IDs), TTL expiry between calls, or a prefix that's too short. Keep the static part byte-identical and put variable content last.

Provider behavior current as of June 2026. Check each provider's docs for exact rates and TTLs.

Prompt caching cookbook: every provider's syntax, every gotcha.