Production considerations

Advanced AI Workflows & the Claude API

Chapter 10 · Production Considerations

A Claude integration that works in development can fail in production in ways that aren't obvious until real users hit them — rate limits under load, runaway costs from verbose prompts, safety gaps you didn't test for. This chapter covers the operational concerns that separate a reliable AI feature from a fragile prototype: handling API limits, managing cost, logging what matters, and a pre-launch checklist to work through before you go live.

Rate Limits — What They Are and How to Handle Them

Anthropic enforces three rate limits simultaneously. Hitting any one of them returns a 429 Too Many Requests error:

Requests per minute (RPM)

Maximum number of API calls in a 60-second window. Most relevant for apps that make many short parallel requests — batch jobs, multi-agent pipelines.

Input tokens per minute (ITPM)

Maximum input tokens processed per minute. Apps with long system prompts or large injected documents hit this before RPM — 10 calls with 50k tokens each saturates a 500k ITPM limit.

Output tokens per minute (OTPM)

Maximum output tokens generated per minute. Relevant for apps generating long documents or code. Streaming doesn't change this limit — tokens are counted as they're generated.

Limits are per-model and per-tier

Limits differ between Opus, Sonnet, and Haiku, and increase as your account tier increases (based on spend history). Check your current limits in the Anthropic console.

Retry with exponential backoff

When a 429 arrives, wait before retrying — and increase the wait time with each attempt. Retrying immediately just generates more 429s.

retry.py — exponential backoff for 429 errorspython

      import time, anthropic

      from anthropic import RateLimitError, APIStatusError

      def call_with_retry(client, max_retries=5, **kwargs) -> str:

          for attempt in range(max_retries):

              try:

                  response = client.messages.create(**kwargs)

                  return response.content[0].text

              except RateLimitError:

                  if attempt == max_retries - 1:

                      raise

                  wait = (2 ** attempt) + random.random()  # jitter

                  time.sleep(wait)

              except APIStatusError as e:

                  if e.status_code >= 500:  # server errors — retry

                      wait = 2 ** attempt

                      time.sleep(wait)

                  else:

                      raise  # 4xx errors (except 429) — don't retry

Add jitter to prevent thundering herd

If many requests fail at once and all retry after exactly the same backoff interval, they'll all hit the API at the same time — causing another wave of 429s. Adding a small random value (random.random()) to the wait time staggers retries and prevents this.

Cost Management

Claude API cost is driven by token volume: input tokens + output tokens × the per-token rate for your chosen model. The biggest lever you have is controlling what goes into each request.

Strategy	How it helps	Trade-off
Right-size your model	Haiku costs a fraction of Opus. Use Haiku for classification, extraction, and simple Q&A. Reserve Sonnet/Opus for reasoning-heavy tasks.	Weaker models fail on complex tasks — measure quality, don't assume.
Prune conversation history	Keep only the last N turns rather than the full history. Each turn you drop removes its tokens from every future request.	Claude loses earlier context — fine for most chats, problematic for long reference conversations.
Prompt caching	Cache stable prefixes (system prompt, reference documents) — 90% cheaper on cache reads. Covered in Chapter 6.	5-minute TTL; write cost is 25% higher than standard on first call.
Set max_tokens tightly	Set `max_tokens` to the actual maximum you expect — don't leave it at 4096 for answers that are always under 200 tokens.	Truncates legitimate long responses if set too low.
Per-user spend limits	Track token usage per user/session and enforce a soft cap — warn or rate-limit users who are consuming disproportionately.	Requires token tracking infrastructure; adds complexity.

Logging — What to Record for Every Claude Call

Good logging lets you debug failures, audit outputs, track cost, and build your eval dataset from production traffic. Capture the following for every API call:

Structured log entry — one Claude API call

{
  "timestamp": "2026-06-19T09:14:33Z",
  "request_id": "req_01XkPm...", # from response headers
  "session_id": "user-abc-session-7",
  "model": "claude-sonnet-4-6",
  "system_prompt_hash": "a3f9c1...", # hash not content — for cost/version tracking
  "user_message": "Can I get a refund?",
  "response_text": "Refunds are available within 30 days...",
  "stop_reason": "end_turn",
  "input_tokens": 412,
  "output_tokens": 87,
  "cache_read_tokens": 380,
  "latency_ms": 1243,
  "cost_usd": 0.000284
}

logging_wrapper.py — log every Claude call automaticallypython

      import time, hashlib, json, logging

      # Sonnet pricing — update if using a different model

      INPUT_COST_PER_TOKEN  = 3e-6

      OUTPUT_COST_PER_TOKEN = 15e-6

      CACHE_READ_COST       = 0.3e-6  # 90% cheaper

      def logged_call(client, session_id: str, **kwargs) -> str:

          system_hash = hashlib.md5(

              str(kwargs.get("system", "")).encode()

          ).hexdigest()[:8]

          t0 = time.time()

          response = client.messages.create(**kwargs)

          latency = int((time.time() - t0) * 1000)

          u = response.usage

          cost = (

              u.input_tokens * INPUT_COST_PER_TOKEN

            + u.output_tokens * OUTPUT_COST_PER_TOKEN

            + getattr(u, "cache_read_input_tokens", 0) * CACHE_READ_COST

          )

          logging.info(json.dumps({

              "session_id":        session_id,

              "model":             kwargs["model"],

              "system_hash":      system_hash,

              "input_tokens":     u.input_tokens,

              "output_tokens":    u.output_tokens,

              "latency_ms":       latency,

              "cost_usd":         round(cost, 6),

              "stop_reason":      response.stop_reason,

          }))

          return response.content[0].text

Safety and Content Filtering

Claude has built-in safety guardrails, but they are not a substitute for application-level controls. You are responsible for what your app does with Claude's outputs.

Scope your system prompt — tell Claude explicitly what topics and actions are in and out of scope. "You are a customer support assistant for our software product. Do not discuss competitor products, pricing negotiations, or topics unrelated to software support."
Validate outputs before acting — if Claude's output drives an action (sending an email, running a database query, executing code), validate the output before execution. Never pass Claude's raw output directly to a shell or SQL interpreter.
Check stop_reason — a response that ends with max_tokens is truncated. A truncated SQL query or JSON object is worse than no response — always check that stop_reason == "end_turn" before using structured outputs.
Log inputs, not just outputs — problematic outputs often require the input to diagnose. Log both, but scrub PII before writing to persistent storage.
Prompt injection awareness — if users can supply content that Claude will read (documents, messages, URLs), malicious users can embed instructions in that content. Separate user-controlled content from trusted system instructions clearly in your prompt structure.

Pre-Launch Checklist

API & Credentials

API key stored in environment variable, not hardcoded

API key has appropriate spend limit set in Anthropic console

Key is not exposed in frontend code, logs, or git history

Reliability

Retry logic with exponential backoff + jitter for 429 and 5xx errors

Timeout set on all API calls — don't let requests hang indefinitely

Graceful error messages to users when Claude is unavailable

stop_reason checked — truncated responses handled correctly

Cost

Token usage logged per request with calculated cost

Model choice justified — not defaulting to Opus for simple tasks

Prompt caching enabled for stable prefixes over 1,024 tokens

History pruning in place — conversation length is bounded

Safety & Quality

Eval suite passing — no regressions on known edge cases

System prompt explicitly scopes allowed topics and actions

Structured outputs validated before being acted on

User-supplied content separated from system instructions in prompts

PII scrubbed from logs before writing to persistent storage

Observability

Structured JSON logs for every Claude call

Latency, token count, and cost tracked over time

Alert on sustained error rate spike or abnormal cost increase

🎓

Course Complete

Advanced AI Workflows & the Claude API · 10 Chapters

You've covered the full stack of building with Claude — from raw API calls to streaming, tool use, multi-agent systems, RAG, evaluation, and production operations. You have everything you need to build reliable, cost-efficient, AI-powered features.

Claude API System Prompts Tool Use Streaming FastAPI App Prompt Caching Multi-Agent RAG Evaluation Production