Evaluating outputs
Advanced AI Workflows & the Claude API
Traditional software has deterministic outputs — the same input always produces the same result, and tests either pass or fail. AI features don't work that way. Claude's outputs are probabilistic: they vary across runs, they can be correct in spirit but wrong in detail, and "good" is often a matter of degree rather than a binary. Evaluating AI outputs requires a different toolkit — one built around sampling, scoring, and tracking distributions rather than exact matches.
Why Evaluation Matters
- Prompt changes break things silently — a small wording tweak that improves one case often degrades another. Without an eval suite, you won't know until users complain.
- Model updates change behaviour — when Anthropic releases a new Claude version, your prompts may perform differently. Evals let you measure this before switching.
- Hallucination is invisible without checking — Claude sounds confident whether it's right or wrong. Evals surface the cases where confidence isn't warranted.
- "It looks good to me" doesn't scale — manual review of 10 outputs is feasible; 500 is not. Automated scoring lets you evaluate at the scale your feature operates at.
The Evaluation Loop
inputs + expected outputs
against all inputs
rubric / LLM judge / exact match
pass / fail / regression?
Building an Eval Set
An eval set is a collection of test cases: each case has an input (the prompt or user message) and either an expected output or criteria for what a good output looks like. Start small — 20–50 cases is enough to catch most regressions — and grow it over time by adding cases whenever you find an output that surprised you.
What makes a good eval set
- Golden path cases — the most common, typical inputs your feature handles.
- Edge cases — boundary conditions, ambiguous inputs, the cases that have broken things before.
- Adversarial cases — inputs designed to produce failure: off-topic questions, jailbreak attempts, inputs with missing information.
- Regression cases — every time you find a bug in production, add it to the eval set so it can't come back silently.
{
"id": "refund_policy_basic",
"input": "Can I get a refund after 30 days?",
"expected": "no", # exact match where possible
"criteria": "Must state refunds are only available within 30 days. Polite tone."
},
{
"id": "off_topic",
"input": "What is the capital of France?",
"expected": None,
"criteria": "Must decline to answer and redirect to support topics."
},
{
"id": "ambiguous_input",
"input": "I want to cancel.",
"expected": None,
"criteria": "Must ask a clarifying question — cancel what?"
},
]
Scoring Methods
LLM-as-Judge — Implementation
judge_client = anthropic.Anthropic()
JUDGE_SYSTEM = """You are a strict evaluator. Given a user question, a set of \
criteria, and an AI response, score the response as PASS or FAIL \
for each criterion. Return JSON only, no explanation outside JSON. \
Format: {"scores": [{"criterion": "...", "result": "PASS"|"FAIL", \
"reason": "..."}], "overall": "PASS"|"FAIL"}"""
def judge(question: str, response: str, criteria: list[str]) -> dict:
criteria_text = "\n".join(f"- {c}" for c in criteria)
prompt = f"""Question: {question}
Criteria:
{criteria_text}
Response to evaluate:
{response}"""
result = judge_client.messages.create(
model="claude-opus-4-8", # use strongest model for judging
max_tokens=512,
system=JUDGE_SYSTEM,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(result.content[0].text)
def run_evals(eval_set: list, system_prompt: str) -> dict:
results = []
for case in eval_set:
# Get Claude's answer
answer = call_claude(system_prompt, case["input"])
# Score with criteria from the eval case
score = judge(case["input"], answer, case["criteria"].split(". "))
results.append({"id": case["id"], "answer": answer, "score": score})
passed = sum(1 for r in results if r["score"]["overall"] == "PASS")
return {"pass_rate": passed / len(results), "cases": results}
Rubric Scoring — Example
For a customer support bot, you might score each response on four criteria. Using a numeric scale (1–4) instead of pass/fail gives you a finer signal for tracking gradual drift:
| Criterion | 1 — Poor | 2 — Fair | 3 — Good | 4 — Excellent |
|---|---|---|---|---|
| Accuracy | 1 Factually wrong | 2 Partially correct | 3 Correct but incomplete | 4 Fully correct |
| Tone | 1 Rude or dismissive | 2 Neutral but cold | 3 Friendly | 4 Warm, empathetic |
| Conciseness | 1 Rambling / off-topic | 2 Too long | 3 Appropriate length | 4 Crisp and complete |
| Scope | 1 Answered wrong question | 2 Partially addressed | 3 Answered the question | 4 Answered + anticipated follow-up |
Catching Regressions
Run your eval suite before and after every prompt change, model upgrade, or system change. Compare the scores — any drop in pass rate or average score is a regression signal. Even a small drop across many cases is worth investigating before deploying.
In this example, v2 improves overall accuracy and ambiguous input handling — but off-topic rejection got worse. You can see exactly which cases regressed and investigate those prompts specifically, rather than rolling back the whole change.