Production considerations
Advanced AI Workflows & the Claude API
A Claude integration that works in development can fail in production in ways that aren't obvious until real users hit them — rate limits under load, runaway costs from verbose prompts, safety gaps you didn't test for. This chapter covers the operational concerns that separate a reliable AI feature from a fragile prototype: handling API limits, managing cost, logging what matters, and a pre-launch checklist to work through before you go live.
Rate Limits — What They Are and How to Handle Them
Anthropic enforces three rate limits simultaneously. Hitting any one of them returns a 429 Too Many Requests error:
Retry with exponential backoff
When a 429 arrives, wait before retrying — and increase the wait time with each attempt. Retrying immediately just generates more 429s.
from anthropic import RateLimitError, APIStatusError
def call_with_retry(client, max_retries=5, **kwargs) -> str:
for attempt in range(max_retries):
try:
response = client.messages.create(**kwargs)
return response.content[0].text
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.random() # jitter
time.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500: # server errors — retry
wait = 2 ** attempt
time.sleep(wait)
else:
raise # 4xx errors (except 429) — don't retry
random.random()) to the wait time staggers retries and prevents this.
Cost Management
Claude API cost is driven by token volume: input tokens + output tokens × the per-token rate for your chosen model. The biggest lever you have is controlling what goes into each request.
| Strategy | How it helps | Trade-off |
|---|---|---|
| Right-size your model | Haiku costs a fraction of Opus. Use Haiku for classification, extraction, and simple Q&A. Reserve Sonnet/Opus for reasoning-heavy tasks. | Weaker models fail on complex tasks — measure quality, don't assume. |
| Prune conversation history | Keep only the last N turns rather than the full history. Each turn you drop removes its tokens from every future request. | Claude loses earlier context — fine for most chats, problematic for long reference conversations. |
| Prompt caching | Cache stable prefixes (system prompt, reference documents) — 90% cheaper on cache reads. Covered in Chapter 6. | 5-minute TTL; write cost is 25% higher than standard on first call. |
| Set max_tokens tightly | Set max_tokens to the actual maximum you expect — don't leave it at 4096 for answers that are always under 200 tokens. |
Truncates legitimate long responses if set too low. |
| Per-user spend limits | Track token usage per user/session and enforce a soft cap — warn or rate-limit users who are consuming disproportionately. | Requires token tracking infrastructure; adds complexity. |
Logging — What to Record for Every Claude Call
Good logging lets you debug failures, audit outputs, track cost, and build your eval dataset from production traffic. Capture the following for every API call:
"timestamp": "2026-06-19T09:14:33Z",
"request_id": "req_01XkPm...", # from response headers
"session_id": "user-abc-session-7",
"model": "claude-sonnet-4-6",
"system_prompt_hash": "a3f9c1...", # hash not content — for cost/version tracking
"user_message": "Can I get a refund?",
"response_text": "Refunds are available within 30 days...",
"stop_reason": "end_turn",
"input_tokens": 412,
"output_tokens": 87,
"cache_read_tokens": 380,
"latency_ms": 1243,
"cost_usd": 0.000284
}
# Sonnet pricing — update if using a different model
INPUT_COST_PER_TOKEN = 3e-6
OUTPUT_COST_PER_TOKEN = 15e-6
CACHE_READ_COST = 0.3e-6 # 90% cheaper
def logged_call(client, session_id: str, **kwargs) -> str:
system_hash = hashlib.md5(
str(kwargs.get("system", "")).encode()
).hexdigest()[:8]
t0 = time.time()
response = client.messages.create(**kwargs)
latency = int((time.time() - t0) * 1000)
u = response.usage
cost = (
u.input_tokens * INPUT_COST_PER_TOKEN
+ u.output_tokens * OUTPUT_COST_PER_TOKEN
+ getattr(u, "cache_read_input_tokens", 0) * CACHE_READ_COST
)
logging.info(json.dumps({
"session_id": session_id,
"model": kwargs["model"],
"system_hash": system_hash,
"input_tokens": u.input_tokens,
"output_tokens": u.output_tokens,
"latency_ms": latency,
"cost_usd": round(cost, 6),
"stop_reason": response.stop_reason,
}))
return response.content[0].text
Safety and Content Filtering
Claude has built-in safety guardrails, but they are not a substitute for application-level controls. You are responsible for what your app does with Claude's outputs.
- Scope your system prompt — tell Claude explicitly what topics and actions are in and out of scope. "You are a customer support assistant for our software product. Do not discuss competitor products, pricing negotiations, or topics unrelated to software support."
- Validate outputs before acting — if Claude's output drives an action (sending an email, running a database query, executing code), validate the output before execution. Never pass Claude's raw output directly to a shell or SQL interpreter.
- Check
stop_reason— a response that ends withmax_tokensis truncated. A truncated SQL query or JSON object is worse than no response — always check thatstop_reason == "end_turn"before using structured outputs. - Log inputs, not just outputs — problematic outputs often require the input to diagnose. Log both, but scrub PII before writing to persistent storage.
- Prompt injection awareness — if users can supply content that Claude will read (documents, messages, URLs), malicious users can embed instructions in that content. Separate user-controlled content from trusted system instructions clearly in your prompt structure.
Pre-Launch Checklist
stop_reason checked — truncated responses handled correctly