Prompt Caching for LLM Apps: What It Is and When It Pays Off
How providers cache your prompt prefix, the real discounts, and the prompt structure that decides whether you save anything
Prompt caching is one of the few LLM optimizations that is close to free: you can cut the cost of your input tokens by up to 90% and shave time off every request, often with a single line of code. The catch is that it only works if your prompt is shaped so the cache actually hits, and a lot of apps accidentally bust the cache on every call and pay extra for nothing. This is what caching does, what the real numbers are, and how to structure prompts so it pays off.
What it actually caches#
When you send a prompt, the model has to process the whole thing before it generates a token. If a big chunk of that prompt is identical across many requests, a system prompt, a long instruction block, a few-shot set, or a fixed document, reprocessing it every time is wasted work. Prompt caching lets the provider store the processed prefix and reuse it, so you pay full price once and a steep discount on every reuse.
The key word is prefix. Caching matches from the start of the prompt up to the point where requests start to differ. Everything before the first difference can be cached. Everything from the first difference onward cannot. That one fact decides whether caching helps you or does nothing, and it is the thing most people get wrong.
The real numbers#
The discounts and the way you turn it on differ by provider, and the difference matters.
Anthropic is explicit and aggressive. You mark what to cache with cache_control, and reads come back at 0.1x the standard input rate, a 90% discount. Writes cost a premium: 1.25x the standard rate for a 5-minute time-to-live, or 2x for a 1-hour TTL. So the first request that writes the cache costs a bit more, and every read within the TTL is 90% off.
OpenAI does it automatically. For prompts over 1,024 tokens on its current models, caching kicks in with no code changes and roughly a 50% discount on the cached portion. You do not opt in; you just check your usage dashboard to confirm it is happening.
So the shape of the decision is: Anthropic saves more (90% vs 50%) but you have to mark the prefix and you pay a small write premium, while OpenAI saves less but costs you zero effort. For a high-volume app with a large stable prefix, Anthropic's 90% read discount is the bigger prize, which is why the "Anthropic wins on cache pricing" line gets repeated everywhere.
A short TTL (5 minutes on Anthropic) means caching pays off when the same prefix is reused soon. A burst of requests sharing a system prompt benefits hugely. One request every ten minutes will keep paying the write premium and rarely hit a live cache.
Structure your prompt so the cache hits#
Here is the rule that makes or breaks it: put the stable content first and the volatile content last. The cache matches an exact prefix, so anything that changes per request has to come after everything you want cached.
# GOOD: stable prefix first, volatile content last.
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT + FEW_SHOT_EXAMPLES, # identical every call
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_question}], # the only thing that varies
)Now compare the version that quietly breaks:
# BAD: a timestamp at the top changes the prefix on every request,
# so nothing after it can ever be cached.
system_prompt = f"Current time: {datetime.now()}\n\n" + LONG_SYSTEM_PROMPTThat single dynamic line at the top means the prefix is never identical twice, so the cache misses every time and you pay the write premium for no benefit. Move the timestamp into the user message at the end, and the big static block above it caches cleanly. The fix is almost always "reorder the prompt," not "add more caching."
Where it pays off most#
Prompt caching shines in a few specific patterns:
- Long system prompts and few-shot blocks. If every request carries 2,000 tokens of instructions and examples, caching that prefix is a near-permanent discount on the bulk of your input.
- RAG with a stable context. When the same retrieved documents serve a burst of follow-up questions, cache the documents once and pay the discount on each question. This pairs naturally with how RAG pipelines retrieve and reuse context, which I covered in what a vector database is and how RAG uses it.
- Multi-turn agents. The conversation history is a growing prefix that is identical up to the latest turn. Caching it means each new turn only pays full price for the new message, not the whole transcript again. That is a big deal for the long agent runs I described in LangGraph state, checkpointing, and persistence.
When it does not help#
Be honest about the cases where caching is the wrong lever. If every request is short and unique, there is no shared prefix to cache. If your stable content is small (a one-line system prompt), the savings are trivial. And on Anthropic, if requests are spread far apart in time, the prefix expires between calls and you keep paying write premiums without ever banking a read. Caching rewards volume and repetition. Low-traffic, all-unique workloads get little from it.
Measure the hit rate#
Do not assume it is working. Both providers report cache reads and writes in the response usage, so log them and watch the ratio. A healthy setup shows mostly cache reads after the first request in a burst. If you see mostly writes, your prefix is changing when you think it is stable, and that BAD example above is usually the culprit.
resp = client.messages.create(...)
print(resp.usage.cache_creation_input_tokens) # writes
print(resp.usage.cache_read_input_tokens) # reads (the cheap ones)The bottom line#
Prompt caching is the rare optimization with almost no downside and a large upside, but only if you earn it. Put the stable stuff first, keep the volatile stuff last, make sure your traffic actually reuses the prefix within the TTL, and watch the read-to-write ratio to confirm. Get that right and you cut the cost of your biggest, most repetitive input by most of its price. It is one of the highest-return moves in the broader playbook I laid out in cutting LLM cost and latency, and usually the first one I reach for.

Folarin Akinloye is an AI Engineer based in London, UK. He builds production-ready agentic AI systems, multi-agent architectures, and sophisticated RAG implementations, and writes about the engineering decisions behind them.