Prompt caching

Advanced AI Workflows & the Claude API

Chapter 6 · Prompt Caching

Every time you call the Claude API, Anthropic's servers process your entire input from scratch — system prompt, conversation history, documents you've loaded, all of it. If that input is long and doesn't change between calls, you're paying to re-encode the same tokens repeatedly. Prompt caching lets the API store a prefix of your input server-side and reuse it across calls, reducing both latency and cost significantly.

How Caching Works

You mark a position in your input with a cache breakpoint. Everything up to that point is eligible to be stored in a cache on Anthropic's servers. If the next request starts with the same prefix up to that breakpoint, the cached version is used instead of re-processing those tokens.

Request structure — with a cache breakpoint after the system prompt

Call 1

system prompt (2,000 tokens)

cache_control: ephemeral

user message

Call 2

system prompt (CACHED)

cache_control: ephemeral

user message

Call 3

system prompt (CACHED)

cache_control: ephemeral

turn 1+2 (CACHED)

cache_control: ephemeral

new message

CACHED

tokens read from cache — not re-processed, billed at cache read rate

FRESH

tokens processed and written to cache — billed at cache write rate (slightly higher)

MARKER

cache_control breakpoint — marks where the cached prefix ends

The cache has a 5-minute TTL (time to live). If you don't make another request within 5 minutes, the cached prefix expires and will be re-processed on the next call. For active sessions this isn't usually a problem — users are typically sending messages more frequently than that.

Cost and Latency Impact

Token type	Billing rate	Latency
Standard input tokens	100% (baseline)	Full processing time
Cache write (first call)	125% — slightly more than standard	Slightly slower — writing to cache
Cache read (subsequent calls)	10% — 90% cheaper than standard	Up to 85% lower latency

The break-even point is low: if you send the same prefix twice, you've already saved money on the second call despite the write premium on the first. For apps where users have long conversations against a fixed system prompt, the savings compound quickly.

Adding a Cache Breakpoint

You add caching by including "cache_control": {"type": "ephemeral"} in the content block you want to cache up to. The most common pattern is caching the system prompt:

cache_system_prompt.py — caching a fixed system promptpython

      import anthropic

      client = anthropic.Anthropic()

      response = client.messages.create(

          model="claude-sonnet-4-6",

          max_tokens=1024,

          # System prompt as a list of content blocks so we can add cache_control

          system=[

              {

                  "type": "text",

                  "text": LONG_SYSTEM_PROMPT,

                  # Mark this block as the cache boundary

                  "cache_control": {"type": "ephemeral"}

              }

          ],

          messages=[{"role": "user", "content": "Summarise the key points."}]

      )

Minimum cacheable length

The cached prefix must be at least 1,024 tokens for Claude Sonnet and Haiku models, and 2,048 tokens for Opus. If your system prompt is shorter than this, the cache write is silently skipped and you're billed at the standard rate. Caching is most impactful with long, stable prefixes — detailed instructions, injected documents, reference data.

Caching Conversation History

For chat applications, you can also cache the conversation history up to the previous turn — so each new message only processes the new user input, not the entire accumulated history. Place a breakpoint at the last assistant message:

cache_conversation.py — caching history + system promptpython

      def build_messages_with_cache(history: list, new_message: str) -> list:

          """Add a cache breakpoint at the end of the existing history."""

          messages = []

          for i, msg in enumerate(history):

              if i == len(history) - 1:  # last historical message

                  # Wrap content as a block so we can add cache_control

                  messages.append({

                      "role": msg["role"],

                      "content": [{

                          "type": "text",

                          "text": msg["content"],

                          "cache_control": {"type": "ephemeral"}

                      }]

                  })

              else:

                  messages.append(msg)  # earlier turns — no marker needed

          # New user message goes after the cached history

          messages.append({"role": "user", "content": new_message})

          return messages

Reading Cache Stats from the Response

Every response includes a usage object that shows how many tokens came from cache vs were freshly processed. Use this to verify caching is working and to track your savings:

response.usage = {
  "input_tokens": 28← only the new user message (not cached)
  "cache_creation_input_tokens": 2048← tokens written to cache (call 1 only)
  "cache_read_input_tokens": 2048← tokens read from cache (calls 2+) — 90% cheaper
  "output_tokens": 312
}

On the first call, cache_creation_input_tokens will be non-zero and cache_read_input_tokens will be zero — the prefix is being written. On subsequent calls within the 5-minute window, those numbers reverse.

Caching a Large Document

One of the most valuable caching patterns is loading a large document (a code file, a policy document, a product catalogue) once and asking many questions about it. The document is cached after the first call; subsequent questions only bill for the short question text:

cache_document.py — Q&A over a cached documentpython

      with open("large_document.txt") as f:

          doc = f.read()

      def ask(question: str) -> str:

          response = client.messages.create(

              model="claude-sonnet-4-6",

              max_tokens=512,

              system=[{

                  "type": "text",

                  "text": f"Answer questions about the following document:\n\n{doc}",

                  "cache_control": {"type": "ephemeral"}  # cache the whole doc

              }],

              messages=[{"role": "user", "content": question}]

          )

          return response.content[0].text

      # First call writes the document to cache (~slow + write cost)

      ask("What is the main topic?")

      # Subsequent calls read from cache — fast and cheap

      ask("List the key recommendations.")

      ask("What are the risks mentioned?")

When Caching Helps Most

Long system prompts

Detailed persona definitions, output format instructions, brand voice guides. If your system prompt is over 1,000 tokens and doesn't change between users, cache it.

Document Q&A

Load a codebase, manual, or dataset once and ask many questions. The document is cached after the first query; all subsequent questions only pay for the question itself.

Long conversations

As history grows, caching earlier turns means each new message only bills for the fresh content. Essential for applications where users have extended sessions.

Shared context across users

If all users of your app share a common context (e.g. a product catalogue), that prefix can be cached once and reused across all requests — not just per-session.

When Caching Doesn't Help

Short system prompts — below the 1,024-token minimum, caching is silently skipped.
Highly dynamic prefixes — if you inject a different user name, timestamp, or per-request data into the system prompt, the prefix changes every time and won't hit the cache. Move dynamic content after the cache breakpoint.
Infrequent requests — if calls are spaced more than 5 minutes apart, the cache expires and you pay full price each time. The TTL resets on each cache hit, so active sessions benefit; low-traffic endpoints may not.
One-off batch jobs — if you process each item independently with no shared prefix, there's nothing to cache.

Design tip — separate static from dynamic

Structure your system prompt so everything stable (persona, rules, documents) comes first, followed by your cache breakpoint. Dynamic content — the specific user's name, the current date, per-request injections — comes after the breakpoint in the user message. This maximises what can be cached without sacrificing flexibility.

Next — Chapter 7: Multi-Agent Patterns
A single Claude call has limits — context window, response length, the kinds of reasoning it can do in one pass. Multi-agent patterns chain Claude instances together: an orchestrator breaks a task into subtasks, subagents execute them in parallel or sequence, and results are combined. Chapter 7 covers the core patterns, when to use them, and how to implement a simple orchestrator–worker pipeline.