Production considerations

Advanced AI Workflows & the Claude API

Chapter 10  ·  Production Considerations

A Claude integration that works in development can fail in production in ways that aren't obvious until real users hit them — rate limits under load, runaway costs from verbose prompts, safety gaps you didn't test for. This chapter covers the operational concerns that separate a reliable AI feature from a fragile prototype: handling API limits, managing cost, logging what matters, and a pre-launch checklist to work through before you go live.

Rate Limits — What They Are and How to Handle Them

Anthropic enforces three rate limits simultaneously. Hitting any one of them returns a 429 Too Many Requests error:

Requests per minute (RPM)
Maximum number of API calls in a 60-second window. Most relevant for apps that make many short parallel requests — batch jobs, multi-agent pipelines.
Input tokens per minute (ITPM)
Maximum input tokens processed per minute. Apps with long system prompts or large injected documents hit this before RPM — 10 calls with 50k tokens each saturates a 500k ITPM limit.
Output tokens per minute (OTPM)
Maximum output tokens generated per minute. Relevant for apps generating long documents or code. Streaming doesn't change this limit — tokens are counted as they're generated.
Limits are per-model and per-tier
Limits differ between Opus, Sonnet, and Haiku, and increase as your account tier increases (based on spend history). Check your current limits in the Anthropic console.

Retry with exponential backoff

When a 429 arrives, wait before retrying — and increase the wait time with each attempt. Retrying immediately just generates more 429s.

retry.py — exponential backoff for 429 errorspython
import time, anthropic
from anthropic import RateLimitError, APIStatusError

def call_with_retry(client, max_retries=5, **kwargs) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(**kwargs)
            return response.content[0].text
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.random() # jitter
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500: # server errors — retry
                wait = 2 ** attempt
                time.sleep(wait)
            else:
                raise # 4xx errors (except 429) — don't retry
Add jitter to prevent thundering herd
If many requests fail at once and all retry after exactly the same backoff interval, they'll all hit the API at the same time — causing another wave of 429s. Adding a small random value (random.random()) to the wait time staggers retries and prevents this.

Cost Management

Claude API cost is driven by token volume: input tokens + output tokens × the per-token rate for your chosen model. The biggest lever you have is controlling what goes into each request.

StrategyHow it helpsTrade-off
Right-size your model Haiku costs a fraction of Opus. Use Haiku for classification, extraction, and simple Q&A. Reserve Sonnet/Opus for reasoning-heavy tasks. Weaker models fail on complex tasks — measure quality, don't assume.
Prune conversation history Keep only the last N turns rather than the full history. Each turn you drop removes its tokens from every future request. Claude loses earlier context — fine for most chats, problematic for long reference conversations.
Prompt caching Cache stable prefixes (system prompt, reference documents) — 90% cheaper on cache reads. Covered in Chapter 6. 5-minute TTL; write cost is 25% higher than standard on first call.
Set max_tokens tightly Set max_tokens to the actual maximum you expect — don't leave it at 4096 for answers that are always under 200 tokens. Truncates legitimate long responses if set too low.
Per-user spend limits Track token usage per user/session and enforce a soft cap — warn or rate-limit users who are consuming disproportionately. Requires token tracking infrastructure; adds complexity.

Logging — What to Record for Every Claude Call

Good logging lets you debug failures, audit outputs, track cost, and build your eval dataset from production traffic. Capture the following for every API call:

Structured log entry — one Claude API call
{
  "timestamp": "2026-06-19T09:14:33Z",
  "request_id": "req_01XkPm...", # from response headers
  "session_id": "user-abc-session-7",
  "model": "claude-sonnet-4-6",
  "system_prompt_hash": "a3f9c1...", # hash not content — for cost/version tracking
  "user_message": "Can I get a refund?",
  "response_text": "Refunds are available within 30 days...",
  "stop_reason": "end_turn",
  "input_tokens": 412,
  "output_tokens": 87,
  "cache_read_tokens": 380,
  "latency_ms": 1243,
  "cost_usd": 0.000284
}
logging_wrapper.py — log every Claude call automaticallypython
import time, hashlib, json, logging

# Sonnet pricing — update if using a different model
INPUT_COST_PER_TOKEN = 3e-6
OUTPUT_COST_PER_TOKEN = 15e-6
CACHE_READ_COST = 0.3e-6 # 90% cheaper

def logged_call(client, session_id: str, **kwargs) -> str:
    system_hash = hashlib.md5(
        str(kwargs.get("system", "")).encode()
    ).hexdigest()[:8]

    t0 = time.time()
    response = client.messages.create(**kwargs)
    latency = int((time.time() - t0) * 1000)

    u = response.usage
    cost = (
        u.input_tokens * INPUT_COST_PER_TOKEN
      + u.output_tokens * OUTPUT_COST_PER_TOKEN
      + getattr(u, "cache_read_input_tokens", 0) * CACHE_READ_COST
    )

    logging.info(json.dumps({
        "session_id": session_id,
        "model": kwargs["model"],
        "system_hash": system_hash,
        "input_tokens": u.input_tokens,
        "output_tokens": u.output_tokens,
        "latency_ms": latency,
        "cost_usd": round(cost, 6),
        "stop_reason": response.stop_reason,
    }))
    return response.content[0].text

Safety and Content Filtering

Claude has built-in safety guardrails, but they are not a substitute for application-level controls. You are responsible for what your app does with Claude's outputs.

  • Scope your system prompt — tell Claude explicitly what topics and actions are in and out of scope. "You are a customer support assistant for our software product. Do not discuss competitor products, pricing negotiations, or topics unrelated to software support."
  • Validate outputs before acting — if Claude's output drives an action (sending an email, running a database query, executing code), validate the output before execution. Never pass Claude's raw output directly to a shell or SQL interpreter.
  • Check stop_reason — a response that ends with max_tokens is truncated. A truncated SQL query or JSON object is worse than no response — always check that stop_reason == "end_turn" before using structured outputs.
  • Log inputs, not just outputs — problematic outputs often require the input to diagnose. Log both, but scrub PII before writing to persistent storage.
  • Prompt injection awareness — if users can supply content that Claude will read (documents, messages, URLs), malicious users can embed instructions in that content. Separate user-controlled content from trusted system instructions clearly in your prompt structure.

Pre-Launch Checklist

API & Credentials
API key stored in environment variable, not hardcoded
API key has appropriate spend limit set in Anthropic console
Key is not exposed in frontend code, logs, or git history
Reliability
Retry logic with exponential backoff + jitter for 429 and 5xx errors
Timeout set on all API calls — don't let requests hang indefinitely
Graceful error messages to users when Claude is unavailable
stop_reason checked — truncated responses handled correctly
Cost
Token usage logged per request with calculated cost
Model choice justified — not defaulting to Opus for simple tasks
Prompt caching enabled for stable prefixes over 1,024 tokens
History pruning in place — conversation length is bounded
Safety & Quality
Eval suite passing — no regressions on known edge cases
System prompt explicitly scopes allowed topics and actions
Structured outputs validated before being acted on
User-supplied content separated from system instructions in prompts
PII scrubbed from logs before writing to persistent storage
Observability
Structured JSON logs for every Claude call
Latency, token count, and cost tracked over time
Alert on sustained error rate spike or abnormal cost increase
🎓
Course Complete
Advanced AI Workflows & the Claude API  ·  10 Chapters

You've covered the full stack of building with Claude — from raw API calls to streaming, tool use, multi-agent systems, RAG, evaluation, and production operations. You have everything you need to build reliable, cost-efficient, AI-powered features.

Claude API System Prompts Tool Use Streaming FastAPI App Prompt Caching Multi-Agent RAG Evaluation Production