Prompt caching

Advanced AI Workflows & the Claude API

Chapter 6  ·  Prompt Caching

Every time you call the Claude API, Anthropic's servers process your entire input from scratch — system prompt, conversation history, documents you've loaded, all of it. If that input is long and doesn't change between calls, you're paying to re-encode the same tokens repeatedly. Prompt caching lets the API store a prefix of your input server-side and reuse it across calls, reducing both latency and cost significantly.

How Caching Works

You mark a position in your input with a cache breakpoint. Everything up to that point is eligible to be stored in a cache on Anthropic's servers. If the next request starts with the same prefix up to that breakpoint, the cached version is used instead of re-processing those tokens.

Request structure — with a cache breakpoint after the system prompt
Call 1
system prompt (2,000 tokens)
cache_control: ephemeral
user message
Call 2
system prompt (CACHED)
cache_control: ephemeral
user message
Call 3
system prompt (CACHED)
cache_control: ephemeral
turn 1+2 (CACHED)
cache_control: ephemeral
new message
CACHED
tokens read from cache — not re-processed, billed at cache read rate
FRESH
tokens processed and written to cache — billed at cache write rate (slightly higher)
MARKER
cache_control breakpoint — marks where the cached prefix ends

The cache has a 5-minute TTL (time to live). If you don't make another request within 5 minutes, the cached prefix expires and will be re-processed on the next call. For active sessions this isn't usually a problem — users are typically sending messages more frequently than that.

Cost and Latency Impact

Token typeBilling rateLatency
Standard input tokens 100% (baseline) Full processing time
Cache write (first call) 125% — slightly more than standard Slightly slower — writing to cache
Cache read (subsequent calls) 10% — 90% cheaper than standard Up to 85% lower latency

The break-even point is low: if you send the same prefix twice, you've already saved money on the second call despite the write premium on the first. For apps where users have long conversations against a fixed system prompt, the savings compound quickly.

Adding a Cache Breakpoint

You add caching by including "cache_control": {"type": "ephemeral"} in the content block you want to cache up to. The most common pattern is caching the system prompt:

cache_system_prompt.py — caching a fixed system promptpython
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,

    # System prompt as a list of content blocks so we can add cache_control
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            # Mark this block as the cache boundary
            "cache_control": {"type": "ephemeral"}
        }
    ],

    messages=[{"role": "user", "content": "Summarise the key points."}]
)
Minimum cacheable length
The cached prefix must be at least 1,024 tokens for Claude Sonnet and Haiku models, and 2,048 tokens for Opus. If your system prompt is shorter than this, the cache write is silently skipped and you're billed at the standard rate. Caching is most impactful with long, stable prefixes — detailed instructions, injected documents, reference data.

Caching Conversation History

For chat applications, you can also cache the conversation history up to the previous turn — so each new message only processes the new user input, not the entire accumulated history. Place a breakpoint at the last assistant message:

cache_conversation.py — caching history + system promptpython
def build_messages_with_cache(history: list, new_message: str) -> list:
    """Add a cache breakpoint at the end of the existing history."""
    messages = []

    for i, msg in enumerate(history):
        if i == len(history) - 1: # last historical message
            # Wrap content as a block so we can add cache_control
            messages.append({
                "role": msg["role"],
                "content": [{
                    "type": "text",
                    "text": msg["content"],
                    "cache_control": {"type": "ephemeral"}
                }]
            })
        else:
            messages.append(msg) # earlier turns — no marker needed

    # New user message goes after the cached history
    messages.append({"role": "user", "content": new_message})
    return messages

Reading Cache Stats from the Response

Every response includes a usage object that shows how many tokens came from cache vs were freshly processed. Use this to verify caching is working and to track your savings:

response.usage = {
  "input_tokens": 28← only the new user message (not cached)
  "cache_creation_input_tokens": 2048← tokens written to cache (call 1 only)
  "cache_read_input_tokens": 2048← tokens read from cache (calls 2+) — 90% cheaper
  "output_tokens": 312
}

On the first call, cache_creation_input_tokens will be non-zero and cache_read_input_tokens will be zero — the prefix is being written. On subsequent calls within the 5-minute window, those numbers reverse.

Caching a Large Document

One of the most valuable caching patterns is loading a large document (a code file, a policy document, a product catalogue) once and asking many questions about it. The document is cached after the first call; subsequent questions only bill for the short question text:

cache_document.py — Q&A over a cached documentpython
with open("large_document.txt") as f:
    doc = f.read()

def ask(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[{
            "type": "text",
            "text": f"Answer questions about the following document:\n\n{doc}",
            "cache_control": {"type": "ephemeral"} # cache the whole doc
        }],
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call writes the document to cache (~slow + write cost)
ask("What is the main topic?")

# Subsequent calls read from cache — fast and cheap
ask("List the key recommendations.")
ask("What are the risks mentioned?")

When Caching Helps Most

Long system prompts
Detailed persona definitions, output format instructions, brand voice guides. If your system prompt is over 1,000 tokens and doesn't change between users, cache it.
Document Q&A
Load a codebase, manual, or dataset once and ask many questions. The document is cached after the first query; all subsequent questions only pay for the question itself.
Long conversations
As history grows, caching earlier turns means each new message only bills for the fresh content. Essential for applications where users have extended sessions.
Shared context across users
If all users of your app share a common context (e.g. a product catalogue), that prefix can be cached once and reused across all requests — not just per-session.

When Caching Doesn't Help

  • Short system prompts — below the 1,024-token minimum, caching is silently skipped.
  • Highly dynamic prefixes — if you inject a different user name, timestamp, or per-request data into the system prompt, the prefix changes every time and won't hit the cache. Move dynamic content after the cache breakpoint.
  • Infrequent requests — if calls are spaced more than 5 minutes apart, the cache expires and you pay full price each time. The TTL resets on each cache hit, so active sessions benefit; low-traffic endpoints may not.
  • One-off batch jobs — if you process each item independently with no shared prefix, there's nothing to cache.
Design tip — separate static from dynamic
Structure your system prompt so everything stable (persona, rules, documents) comes first, followed by your cache breakpoint. Dynamic content — the specific user's name, the current date, per-request injections — comes after the breakpoint in the user message. This maximises what can be cached without sacrificing flexibility.
Next — Chapter 7: Multi-Agent Patterns
A single Claude call has limits — context window, response length, the kinds of reasoning it can do in one pass. Multi-agent patterns chain Claude instances together: an orchestrator breaks a task into subtasks, subagents execute them in parallel or sequence, and results are combined. Chapter 7 covers the core patterns, when to use them, and how to implement a simple orchestrator–worker pipeline.