Multi-agent patterns

Advanced AI Workflows & the Claude API

Chapter 7  ·  Multi-Agent Patterns

A single Claude call is powerful, but it has limits. The context window is finite, long responses can be inconsistent, and some tasks are genuinely too complex to solve in one pass. Multi-agent patterns solve this by splitting work across multiple Claude calls — each call focused on a specific subtask, the results combined by an orchestrator. This chapter covers the core patterns, how to implement them, and when they're worth the added complexity.

Why Use Multiple Agents?

  • Tasks too large for one context window — analysing a large codebase, processing hundreds of documents, generating a long structured report section by section.
  • Parallel work — subtasks that don't depend on each other can run simultaneously, cutting total wall-clock time.
  • Specialisation — different agents can have different system prompts, making one a careful fact-checker, another a creative writer, another a data extractor.
  • Verification — one agent generates an answer; another independently checks it. Disagreements surface errors that a single pass would miss.

Core Patterns

1 — Orchestrator + Workers

Orchestrator / Worker
Input task
Orchestrator
decomposes task
Worker A
subtask 1
Worker B
subtask 2
Worker C
subtask 3
Orchestrator
combines results
Final output
The orchestrator receives the full task, breaks it into subtasks, dispatches them to worker agents (in parallel or sequence), and synthesises the results. Workers are stateless — they don't know about each other or the broader task.

2 — Sequential Pipeline

Sequential Pipeline (each agent sees the previous agent's output)
Raw input
Agent 1
extract
Agent 2
analyse
Agent 3
format
Output
Each agent takes the previous agent's output as input. Good for multi-stage transformations where each step builds on the last — e.g. extract → analyse → summarise → format.

3 — Generator + Critic

Generator / Critic (self-verification)
Generator
produces answer
Critic
checks answer
Accept / Revise?
If revise → loop back to Generator with critic feedback
Two separate Claude instances with different system prompts. The generator writes freely; the critic is instructed to be sceptical and look for errors. Disagreement triggers a revision loop. Effective for factual accuracy, code correctness, or argument quality.

4 — Map / Reduce

Map / Reduce (parallel processing of many items)
100 documents
Split
Extract(doc 1)
Extract(doc 2)
… (parallel)
Reduce
merge / aggregate
Summary report
Each item is processed independently in parallel (map), then results are merged by a final agent (reduce). Ideal for tasks like summarising many documents, classifying a batch of items, or extracting structured data from a large file set.

Implementing an Orchestrator–Worker Pipeline

multi_agent.py — orchestrator dispatches parallel workerspython
import anthropic
import asyncio

client = anthropic.Anthropic()

def call_claude(system: str, prompt: str, model="claude-haiku-4-5-20251001") -> str:
    """Single synchronous Claude call — used by workers."""
    response = client.messages.create(
        model=model,
        max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


async def worker(loop, section: str) -> str:
    """Run a synchronous Claude call in a thread pool."""
    return await loop.run_in_executor(
        None, call_claude,
        "You are a precise summariser. Return 3 bullet points only.",
        f"Summarise this section:\n\n{section}"
    )


async def orchestrate(document: str) -> str:
    # Step 1 — Orchestrator splits the document into sections
    split_result = call_claude(
        "Split the text into logical sections. Return each section separated by '---'.",
        document,
        model="claude-sonnet-4-6"
    )
    sections = [s.strip() for s in split_result.split("---") if s.strip()]

    # Step 2 — Workers summarise all sections in parallel
    loop = asyncio.get_event_loop()
    summaries = await asyncio.gather(
        *[worker(loop, s) for s in sections]
    )

    # Step 3 — Orchestrator merges all summaries into a final report
    combined = "\n\n".join(summaries)
    return call_claude(
        "You are a report writer. Combine these section summaries into a cohesive executive summary.",
        combined,
        model="claude-sonnet-4-6"
    )


if __name__ == "__main__":
    with open("report.txt") as f:
        doc = f.read()
    result = asyncio.run(orchestrate(doc))
    print(result)
Use Haiku for workers, Sonnet/Opus for the orchestrator
Workers do focused, repetitive subtasks — extract, classify, summarise a single item. They don't need a powerful model. Haiku is fast and cheap here. Save Sonnet or Opus for the orchestrator that must reason about the overall task and synthesise results. This mix dramatically reduces cost without sacrificing output quality.

Generator + Critic — Self-Verification Loop

generator_critic.py — two agents check each other's workpython
GENERATOR_SYSTEM = "You are a helpful assistant. Answer the question clearly and thoroughly."

CRITIC_SYSTEM = """You are a strict fact-checker. \
Review the answer below and respond with either: \
PASS: <brief reason> \
FAIL: <specific issue> and <corrected answer>"""


def generate_and_verify(question: str, max_rounds: int = 3) -> str:
    answer = call_claude(GENERATOR_SYSTEM, question)

    for _ in range(max_rounds):
        verdict = call_claude(
            CRITIC_SYSTEM,
            f"Question: {question}\n\nAnswer: {answer}"
        )
        if verdict.startswith("PASS"):
            break # critic is satisfied
        # FAIL — extract corrected answer and feed back to generator
        answer = call_claude(
            GENERATOR_SYSTEM,
            f"Your answer was criticised: {verdict}\nRevise your answer to the question: {question}"
        )

    return answer

Choosing a Pattern

PatternBest forMain trade-off
Orchestrator + Workers Tasks that decompose cleanly into parallel subtasks — research, batch processing, report generation Orchestrator adds latency; decomposition quality determines output quality
Sequential Pipeline Multi-stage transformations where each step builds on the last — extract → clean → analyse → format Simple to implement; errors propagate downstream and compound
Generator + Critic High-stakes answers, factual claims, code correctness — anything worth double-checking 2–3× the API calls and cost; revision loops add latency
Map / Reduce Large collections of similar items — classify 500 emails, extract data from 100 PDFs Parallel calls hit rate limits; reduce step must handle inconsistent map outputs

Tradeoffs vs a Single Call

When multi-agent adds value
  • Task genuinely exceeds one context window
  • Subtasks can run in parallel — total time matters
  • Quality improves with a separate verification step
  • You need specialised behaviour per subtask
  • You're processing many items of the same type
When a single call is better
  • Task fits comfortably in one context window
  • Latency matters — multi-agent is always slower to start
  • Costs are a concern — every agent call is billed separately
  • Errors in decomposition or combination can make outputs worse than a single focused call
  • The task requires deep reasoning across all parts simultaneously
Rate limits compound quickly
Running 20 workers in parallel means 20 simultaneous API calls. Anthropic's rate limits are per-organisation, not per-request — you can hit them quickly with parallel agents. Design your worker pool size to stay within your tier's requests-per-minute limit, and add retry logic with exponential backoff for 429 errors.
Keep agents stateless
Workers should receive everything they need in their prompt and return everything the orchestrator needs in their response. Don't have agents share mutable state — it creates ordering dependencies that break parallelism and make debugging much harder. Treat each agent call as a pure function: input in, output out.
Next — Chapter 8: Retrieval-Augmented Generation (RAG)
Claude's knowledge has a training cutoff and no access to your own data. RAG bridges that gap — you retrieve the most relevant chunks from your documents or database and inject them into Claude's context. Chapter 8 covers how RAG works, how to build a simple vector search pipeline, and the design decisions that determine whether it actually improves answers.