Prompt caching cookbook: every provider's syntax, every gotcha.

Updated June 2026. Caching cuts LLM costs 40–90%. The working syntax for Anthropic, OpenAI and Google — plus the minimums and gotchas that quietly waste the discount.

Prompt caching is the single biggest cost lever in LLM applications, and almost every team we audit is leaving 30–60% of the savings on the table. The idea is simple: if you send the same static prefix (a system prompt, a document, a tool schema) on many calls, the provider stores it and bills repeat reads at roughly 10% of the input rate. The execution is where teams trip. Here's the working syntax for each provider, the minimums that actually save money, and the gotchas. To see the effect on your own prompt, use the calculator.

How much it saves

Cached input is billed at ~10% of the standard input rate. For a chat app with a 2,000-token system prompt hit on every turn, caching that prefix cuts input cost by 80–90% — which for most chat products is a 40–60% cut in the total bill. The longer and more-reused your prefix, the bigger the win.

Anthropic (Claude)

Anthropic uses explicit cache breakpoints. You mark the end of the cacheable prefix with cache_control:

"messages": [
  { "role": "system",
    "content": [
      { "type": "text", "text": "<your long static system prompt>",
        "cache_control": { "type": "ephemeral" } }
    ]
  }
]

See live rates for Claude models.

OpenAI (GPT)

OpenAI auto-caches prefixes — no special parameter:

Google (Gemini)

Google requires an explicit context-caching API call: you create a cached content object, get a handle, and reference it on subsequent requests.

The gotchas that waste the discount

  1. Variable content in the prefix. A timestamp, a session ID, or a reordered tool list at the top breaks the cache every call. Keep the prefix byte-identical.
  2. TTL expiry. On Anthropic, a 5-minute idle gap drops the cache; the next call pays full price and re-caches. High-traffic endpoints benefit most.
  3. Too-short prefixes. Below ~1,024 tokens nothing caches. Consolidate small system messages into one substantial prefix.
  4. Cache misses on cold start. The first call always pays full price + a small write cost. Caching pays off on reuse, not one-offs.
  5. Putting the user message first. Always: static system prompt → documents → variable user turn, in that order.

Stack it with the other levers

Caching combines with the rest of the cost playbook: cap output (max_tokens), batch deferrable jobs (50% off), and route easy work to a cheaper model. Caching + batching + routing together routinely cut a production bill by 70%+.

FAQ

How much does prompt caching save?

Cached input is billed at ~10% of the standard rate — typically 40–60% off the total bill for chat apps with a reused system prompt.

Do OpenAI, Anthropic and Google all support caching?

Yes. OpenAI auto-caches prefixes over ~1,024 tokens, Anthropic uses explicit cache_control breakpoints, and Google uses an explicit context-caching API.

What's the minimum prefix length that caches?

Around 1,024 tokens for OpenAI and Anthropic. Below that, nothing is cached.

Why isn't my cache hitting?

Usually variable content in the prefix (timestamps, IDs), TTL expiry between calls, or a prefix that's too short. Keep the static part byte-identical and put variable content last.

Provider behavior current as of June 2026. Check each provider's docs for exact rates and TTLs.

Advertisement