Updated June 2026. Caching cuts LLM costs 40–90%. The working syntax for Anthropic, OpenAI and Google — plus the minimums and gotchas that quietly waste the discount.
Prompt caching is the single biggest cost lever in LLM applications, and almost every team we audit is leaving 30–60% of the savings on the table. The idea is simple: if you send the same static prefix (a system prompt, a document, a tool schema) on many calls, the provider stores it and bills repeat reads at roughly 10% of the input rate. The execution is where teams trip. Here's the working syntax for each provider, the minimums that actually save money, and the gotchas. To see the effect on your own prompt, use the calculator.
Cached input is billed at ~10% of the standard input rate. For a chat app with a 2,000-token system prompt hit on every turn, caching that prefix cuts input cost by 80–90% — which for most chat products is a 40–60% cut in the total bill. The longer and more-reused your prefix, the bigger the win.
Anthropic uses explicit cache breakpoints. You mark the end of the cacheable prefix with cache_control:
"messages": [
{ "role": "system",
"content": [
{ "type": "text", "text": "<your long static system prompt>",
"cache_control": { "type": "ephemeral" } }
]
}
]
See live rates for Claude models.
OpenAI auto-caches prefixes — no special parameter:
Google requires an explicit context-caching API call: you create a cached content object, get a handle, and reference it on subsequent requests.
Caching combines with the rest of the cost playbook: cap output (max_tokens), batch deferrable jobs (50% off), and route easy work to a cheaper model. Caching + batching + routing together routinely cut a production bill by 70%+.
Cached input is billed at ~10% of the standard rate — typically 40–60% off the total bill for chat apps with a reused system prompt.
Yes. OpenAI auto-caches prefixes over ~1,024 tokens, Anthropic uses explicit cache_control breakpoints, and Google uses an explicit context-caching API.
Around 1,024 tokens for OpenAI and Anthropic. Below that, nothing is cached.
Usually variable content in the prefix (timestamps, IDs), TTL expiry between calls, or a prefix that's too short. Keep the static part byte-identical and put variable content last.
Provider behavior current as of June 2026. Check each provider's docs for exact rates and TTLs.