Prompt caching
Advanced AI Workflows & the Claude API
Every time you call the Claude API, Anthropic's servers process your entire input from scratch — system prompt, conversation history, documents you've loaded, all of it. If that input is long and doesn't change between calls, you're paying to re-encode the same tokens repeatedly. Prompt caching lets the API store a prefix of your input server-side and reuse it across calls, reducing both latency and cost significantly.
How Caching Works
You mark a position in your input with a cache breakpoint. Everything up to that point is eligible to be stored in a cache on Anthropic's servers. If the next request starts with the same prefix up to that breakpoint, the cached version is used instead of re-processing those tokens.
The cache has a 5-minute TTL (time to live). If you don't make another request within 5 minutes, the cached prefix expires and will be re-processed on the next call. For active sessions this isn't usually a problem — users are typically sending messages more frequently than that.
Cost and Latency Impact
| Token type | Billing rate | Latency |
|---|---|---|
| Standard input tokens | 100% (baseline) | Full processing time |
| Cache write (first call) | 125% — slightly more than standard | Slightly slower — writing to cache |
| Cache read (subsequent calls) | 10% — 90% cheaper than standard | Up to 85% lower latency |
The break-even point is low: if you send the same prefix twice, you've already saved money on the second call despite the write premium on the first. For apps where users have long conversations against a fixed system prompt, the savings compound quickly.
Adding a Cache Breakpoint
You add caching by including "cache_control": {"type": "ephemeral"} in the content block you want to cache up to. The most common pattern is caching the system prompt:
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
# System prompt as a list of content blocks so we can add cache_control
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
# Mark this block as the cache boundary
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Summarise the key points."}]
)
Caching Conversation History
For chat applications, you can also cache the conversation history up to the previous turn — so each new message only processes the new user input, not the entire accumulated history. Place a breakpoint at the last assistant message:
"""Add a cache breakpoint at the end of the existing history."""
messages = []
for i, msg in enumerate(history):
if i == len(history) - 1: # last historical message
# Wrap content as a block so we can add cache_control
messages.append({
"role": msg["role"],
"content": [{
"type": "text",
"text": msg["content"],
"cache_control": {"type": "ephemeral"}
}]
})
else:
messages.append(msg) # earlier turns — no marker needed
# New user message goes after the cached history
messages.append({"role": "user", "content": new_message})
return messages
Reading Cache Stats from the Response
Every response includes a usage object that shows how many tokens came from cache vs were freshly processed. Use this to verify caching is working and to track your savings:
"input_tokens": 28← only the new user message (not cached)
"cache_creation_input_tokens": 2048← tokens written to cache (call 1 only)
"cache_read_input_tokens": 2048← tokens read from cache (calls 2+) — 90% cheaper
"output_tokens": 312
}
On the first call, cache_creation_input_tokens will be non-zero and cache_read_input_tokens will be zero — the prefix is being written. On subsequent calls within the 5-minute window, those numbers reverse.
Caching a Large Document
One of the most valuable caching patterns is loading a large document (a code file, a policy document, a product catalogue) once and asking many questions about it. The document is cached after the first call; subsequent questions only bill for the short question text:
doc = f.read()
def ask(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{
"type": "text",
"text": f"Answer questions about the following document:\n\n{doc}",
"cache_control": {"type": "ephemeral"} # cache the whole doc
}],
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
# First call writes the document to cache (~slow + write cost)
ask("What is the main topic?")
# Subsequent calls read from cache — fast and cheap
ask("List the key recommendations.")
ask("What are the risks mentioned?")
When Caching Helps Most
When Caching Doesn't Help
- Short system prompts — below the 1,024-token minimum, caching is silently skipped.
- Highly dynamic prefixes — if you inject a different user name, timestamp, or per-request data into the system prompt, the prefix changes every time and won't hit the cache. Move dynamic content after the cache breakpoint.
- Infrequent requests — if calls are spaced more than 5 minutes apart, the cache expires and you pay full price each time. The TTL resets on each cache hit, so active sessions benefit; low-traffic endpoints may not.
- One-off batch jobs — if you process each item independently with no shared prefix, there's nothing to cache.