Evaluating outputs

Advanced AI Workflows & the Claude API

Chapter 9 · Evaluating Outputs

Traditional software has deterministic outputs — the same input always produces the same result, and tests either pass or fail. AI features don't work that way. Claude's outputs are probabilistic: they vary across runs, they can be correct in spirit but wrong in detail, and "good" is often a matter of degree rather than a binary. Evaluating AI outputs requires a different toolkit — one built around sampling, scoring, and tracking distributions rather than exact matches.

Why Evaluation Matters

Prompt changes break things silently — a small wording tweak that improves one case often degrades another. Without an eval suite, you won't know until users complain.
Model updates change behaviour — when Anthropic releases a new Claude version, your prompts may perform differently. Evals let you measure this before switching.
Hallucination is invisible without checking — Claude sounds confident whether it's right or wrong. Evals surface the cases where confidence isn't warranted.
"It looks good to me" doesn't scale — manual review of 10 outputs is feasible; 500 is not. Automated scoring lets you evaluate at the scale your feature operates at.

The Evaluation Loop

Core eval workflow — build once, run on every prompt change

Build eval set
inputs + expected outputs

→

Run prompt
against all inputs

→

Score outputs
rubric / LLM judge / exact match

→

Review results
pass / fail / regression?

Regression

→

Revise prompt

→

loop back to run

Building an Eval Set

An eval set is a collection of test cases: each case has an input (the prompt or user message) and either an expected output or criteria for what a good output looks like. Start small — 20–50 cases is enough to catch most regressions — and grow it over time by adding cases whenever you find an output that surprised you.

What makes a good eval set

Golden path cases — the most common, typical inputs your feature handles.
Edge cases — boundary conditions, ambiguous inputs, the cases that have broken things before.
Adversarial cases — inputs designed to produce failure: off-topic questions, jailbreak attempts, inputs with missing information.
Regression cases — every time you find a bug in production, add it to the eval set so it can't come back silently.

eval_set.py — structure of a simple eval datasetpython

      EVAL_SET = [

          {

              "id": "refund_policy_basic",

              "input": "Can I get a refund after 30 days?",

              "expected": "no",  # exact match where possible

              "criteria": "Must state refunds are only available within 30 days. Polite tone."

          },

          {

              "id": "off_topic",

              "input": "What is the capital of France?",

              "expected": None,

              "criteria": "Must decline to answer and redirect to support topics."

          },

          {

              "id": "ambiguous_input",

              "input": "I want to cancel.",

              "expected": None,

              "criteria": "Must ask a clarifying question — cancel what?"

          },

      ]

Scoring Methods

Exact match

When: classification, yes/no, structured fields

Compare Claude's output to an expected string. Simple and fast. Only works when correct answers are unambiguous and outputs can be normalised to a consistent form.

Rubric scoring

When: open-ended responses, tone, completeness

Define what a good answer looks like as a checklist. Score each item independently. More flexible than exact match; more interpretable than LLM-as-judge alone.

LLM-as-judge

When: nuanced quality, multiple valid answers

Send Claude's output to a second Claude call (with a strict judging prompt) and ask it to rate the answer. Scales well; biased toward its own style — use Opus or a different model for judging.

Reference comparison

When: you have a known-good answer

Ask the judge: "Does this answer contain the same information as the reference answer, even if worded differently?" Handles paraphrase better than exact match without full open-ended judgement.

LLM-as-Judge — Implementation

llm_judge.py — using Claude to score Claude's outputspython

      import anthropic, json

      judge_client = anthropic.Anthropic()

      JUDGE_SYSTEM = """You are a strict evaluator. Given a user question, a set of \

      criteria, and an AI response, score the response as PASS or FAIL \

      for each criterion. Return JSON only, no explanation outside JSON. \

      Format: {"scores": [{"criterion": "...", "result": "PASS"|"FAIL", \

      "reason": "..."}], "overall": "PASS"|"FAIL"}"""

      def judge(question: str, response: str, criteria: list[str]) -> dict:

          criteria_text = "\n".join(f"- {c}" for c in criteria)

          prompt = f"""Question: {question}

      Criteria:

      {criteria_text}

      Response to evaluate:

      {response}"""

          result = judge_client.messages.create(

              model="claude-opus-4-8",  # use strongest model for judging

              max_tokens=512,

              system=JUDGE_SYSTEM,

              messages=[{"role": "user", "content": prompt}]

          )

          return json.loads(result.content[0].text)

      def run_evals(eval_set: list, system_prompt: str) -> dict:

          results = []

          for case in eval_set:

              # Get Claude's answer

              answer = call_claude(system_prompt, case["input"])

              # Score with criteria from the eval case

              score = judge(case["input"], answer, case["criteria"].split(". "))

              results.append({"id": case["id"], "answer": answer, "score": score})

          passed = sum(1 for r in results if r["score"]["overall"] == "PASS")

          return {"pass_rate": passed / len(results), "cases": results}

Use a different model to judge

When using LLM-as-judge, use a stronger model (Opus) to evaluate outputs from a weaker one (Sonnet, Haiku). A model judging its own outputs has a bias toward accepting answers that resemble its own style — even when wrong. A stronger judge is more likely to catch errors.

Rubric Scoring — Example

For a customer support bot, you might score each response on four criteria. Using a numeric scale (1–4) instead of pass/fail gives you a finer signal for tracking gradual drift:

Criterion	1 — Poor	2 — Fair	3 — Good	4 — Excellent
Accuracy	1 Factually wrong	2 Partially correct	3 Correct but incomplete	4 Fully correct
Tone	1 Rude or dismissive	2 Neutral but cold	3 Friendly	4 Warm, empathetic
Conciseness	1 Rambling / off-topic	2 Too long	3 Appropriate length	4 Crisp and complete
Scope	1 Answered wrong question	2 Partially addressed	3 Answered the question	4 Answered + anticipated follow-up

Catching Regressions

Run your eval suite before and after every prompt change, model upgrade, or system change. Compare the scores — any drop in pass rate or average score is a regression signal. Even a small drop across many cases is worth investigating before deploying.

Eval results — prompt v1 vs prompt v2 (50 cases)

Overall pass rate

74%

82% ↑

Accuracy (avg score)

3.1

3.6 ↑

Tone (avg score)

3.8

3.8 —

Off-topic handling

9/10 pass

6/10 pass ↓

Ambiguous input handling

5/8 pass

8/8 pass ↑

In this example, v2 improves overall accuracy and ambiguous input handling — but off-topic rejection got worse. You can see exactly which cases regressed and investigate those prompts specifically, rather than rolling back the whole change.

Metrics That Actually Predict User Satisfaction

Task completion rate

Did the user get what they asked for? For support bots: did they stop asking follow-up questions? More predictive than quality scores alone.

Refusal rate

How often does Claude decline a valid request? Over-refusal is a real failure mode — track it separately from under-refusal.

Hallucination rate

On RAG or factual tasks: what fraction of responses contain at least one claim that cannot be verified from the source material?

Latency at P95

The 95th-percentile response time matters more than the mean. Slow outliers cause user abandonment even when average speed is fine.

Regression count

How many eval cases that previously passed now fail? Track this as an absolute count, not just a percentage — a 5% drop across 200 cases is 10 broken experiences.

Human override rate

In apps where users can edit Claude's output: how often do they? High edit rates signal that Claude's output isn't landing — even if automated scores look fine.

Evals are only as good as their cases

A 95% pass rate on a weak eval set gives false confidence. Invest time in building adversarial cases and edge cases — the scenarios that are rare but catastrophic when they fail. An eval that only tests the golden path will pass even for a broken feature.

Next — Chapter 10: Production Considerations
The final chapter. Taking your Claude-powered feature from working prototype to reliable production service — rate limits and backoff strategies, cost management, safety and content filtering, structured logging for AI calls, and the deployment checklist before you go live.