Multi-agent patterns

Advanced AI Workflows & the Claude API

Chapter 7 · Multi-Agent Patterns

A single Claude call is powerful, but it has limits. The context window is finite, long responses can be inconsistent, and some tasks are genuinely too complex to solve in one pass. Multi-agent patterns solve this by splitting work across multiple Claude calls — each call focused on a specific subtask, the results combined by an orchestrator. This chapter covers the core patterns, how to implement them, and when they're worth the added complexity.

Why Use Multiple Agents?

Tasks too large for one context window — analysing a large codebase, processing hundreds of documents, generating a long structured report section by section.
Parallel work — subtasks that don't depend on each other can run simultaneously, cutting total wall-clock time.
Specialisation — different agents can have different system prompts, making one a careful fact-checker, another a creative writer, another a data extractor.
Verification — one agent generates an answer; another independently checks it. Disagreements surface errors that a single pass would miss.

Core Patterns

1 — Orchestrator + Workers

Orchestrator / Worker

Input task

→

Orchestrator
decomposes task

→

Worker A
subtask 1

Worker B
subtask 2

Worker C
subtask 3

→

Orchestrator
combines results

→

Final output

The orchestrator receives the full task, breaks it into subtasks, dispatches them to worker agents (in parallel or sequence), and synthesises the results. Workers are stateless — they don't know about each other or the broader task.

2 — Sequential Pipeline

Sequential Pipeline (each agent sees the previous agent's output)

Raw input

→

Agent 1
extract

→

Agent 2
analyse

→

Agent 3
format

→

Output

Each agent takes the previous agent's output as input. Good for multi-stage transformations where each step builds on the last — e.g. extract → analyse → summarise → format.

3 — Generator + Critic

Generator / Critic (self-verification)

Generator
produces answer

→

Critic
checks answer

→

Accept / Revise?

If revise → loop back to Generator with critic feedback

Two separate Claude instances with different system prompts. The generator writes freely; the critic is instructed to be sceptical and look for errors. Disagreement triggers a revision loop. Effective for factual accuracy, code correctness, or argument quality.

4 — Map / Reduce

Map / Reduce (parallel processing of many items)

100 documents

→

Split

→

Extract(doc 1)

Extract(doc 2)

… (parallel)

→

Reduce
merge / aggregate

→

Summary report

Each item is processed independently in parallel (map), then results are merged by a final agent (reduce). Ideal for tasks like summarising many documents, classifying a batch of items, or extracting structured data from a large file set.

Implementing an Orchestrator–Worker Pipeline

multi_agent.py — orchestrator dispatches parallel workerspython

      import anthropic

      import asyncio

      client = anthropic.Anthropic()

      def call_claude(system: str, prompt: str, model="claude-haiku-4-5-20251001") -> str:

          """Single synchronous Claude call — used by workers."""

          response = client.messages.create(

              model=model,

              max_tokens=512,

              system=system,

              messages=[{"role": "user", "content": prompt}]

          )

          return response.content[0].text

      async def worker(loop, section: str) -> str:

          """Run a synchronous Claude call in a thread pool."""

          return await loop.run_in_executor(

              None, call_claude,

              "You are a precise summariser. Return 3 bullet points only.",

              f"Summarise this section:\n\n{section}"

          )

      async def orchestrate(document: str) -> str:

          # Step 1 — Orchestrator splits the document into sections

          split_result = call_claude(

              "Split the text into logical sections. Return each section separated by '---'.",

              document,

              model="claude-sonnet-4-6"

          )

          sections = [s.strip() for s in split_result.split("---") if s.strip()]

          # Step 2 — Workers summarise all sections in parallel

          loop = asyncio.get_event_loop()

          summaries = await asyncio.gather(

              *[worker(loop, s) for s in sections]

          )

          # Step 3 — Orchestrator merges all summaries into a final report

          combined = "\n\n".join(summaries)

          return call_claude(

              "You are a report writer. Combine these section summaries into a cohesive executive summary.",

              combined,

              model="claude-sonnet-4-6"

          )

      if __name__ == "__main__":

          with open("report.txt") as f:

              doc = f.read()

          result = asyncio.run(orchestrate(doc))

          print(result)

Use Haiku for workers, Sonnet/Opus for the orchestrator

Workers do focused, repetitive subtasks — extract, classify, summarise a single item. They don't need a powerful model. Haiku is fast and cheap here. Save Sonnet or Opus for the orchestrator that must reason about the overall task and synthesise results. This mix dramatically reduces cost without sacrificing output quality.

Generator + Critic — Self-Verification Loop

generator_critic.py — two agents check each other's workpython

      GENERATOR_SYSTEM = "You are a helpful assistant. Answer the question clearly and thoroughly."

      CRITIC_SYSTEM = """You are a strict fact-checker. \

      Review the answer below and respond with either: \

      PASS: <brief reason> \

      FAIL: <specific issue> and <corrected answer>"""

      def generate_and_verify(question: str, max_rounds: int = 3) -> str:

          answer = call_claude(GENERATOR_SYSTEM, question)

          for _ in range(max_rounds):

              verdict = call_claude(

                  CRITIC_SYSTEM,

                  f"Question: {question}\n\nAnswer: {answer}"

              )

              if verdict.startswith("PASS"):

                  break  # critic is satisfied

              # FAIL — extract corrected answer and feed back to generator

              answer = call_claude(

                  GENERATOR_SYSTEM,

                  f"Your answer was criticised: {verdict}\nRevise your answer to the question: {question}"

              )

          return answer

Choosing a Pattern

Pattern	Best for	Main trade-off
Orchestrator + Workers	Tasks that decompose cleanly into parallel subtasks — research, batch processing, report generation	Orchestrator adds latency; decomposition quality determines output quality
Sequential Pipeline	Multi-stage transformations where each step builds on the last — extract → clean → analyse → format	Simple to implement; errors propagate downstream and compound
Generator + Critic	High-stakes answers, factual claims, code correctness — anything worth double-checking	2–3× the API calls and cost; revision loops add latency
Map / Reduce	Large collections of similar items — classify 500 emails, extract data from 100 PDFs	Parallel calls hit rate limits; reduce step must handle inconsistent map outputs

Tradeoffs vs a Single Call

When multi-agent adds value

Task genuinely exceeds one context window
Subtasks can run in parallel — total time matters
Quality improves with a separate verification step
You need specialised behaviour per subtask
You're processing many items of the same type

When a single call is better

Task fits comfortably in one context window
Latency matters — multi-agent is always slower to start
Costs are a concern — every agent call is billed separately
Errors in decomposition or combination can make outputs worse than a single focused call
The task requires deep reasoning across all parts simultaneously

Rate limits compound quickly

Running 20 workers in parallel means 20 simultaneous API calls. Anthropic's rate limits are per-organisation, not per-request — you can hit them quickly with parallel agents. Design your worker pool size to stay within your tier's requests-per-minute limit, and add retry logic with exponential backoff for 429 errors.

Keep agents stateless

Workers should receive everything they need in their prompt and return everything the orchestrator needs in their response. Don't have agents share mutable state — it creates ordering dependencies that break parallelism and make debugging much harder. Treat each agent call as a pure function: input in, output out.

Next — Chapter 8: Retrieval-Augmented Generation (RAG)
Claude's knowledge has a training cutoff and no access to your own data. RAG bridges that gap — you retrieve the most relevant chunks from your documents or database and inject them into Claude's context. Chapter 8 covers how RAG works, how to build a simple vector search pipeline, and the design decisions that determine whether it actually improves answers.