Retrieval-augmented generation (RAG)

Advanced AI Workflows & the Claude API

Chapter 8 · Retrieval-Augmented Generation (RAG)

Claude's training has a knowledge cutoff and no awareness of your private data — your company's documentation, your database records, your product catalogue. RAG solves this by fetching relevant content at query time and injecting it into Claude's context. Rather than training a custom model, you retrieve the right information and let Claude reason over it. This chapter covers how RAG works, how to build a simple pipeline, and the design decisions that separate working RAG from broken RAG.

How RAG Works — The Two Phases

RAG pipeline — indexing phase (run once) + query phase (run per request)

Phase 1 — Indexing (offline)

Raw documents

→

Chunk text

→

Embed chunks
→ vectors

→

Vector store
FAISS / pgvector

Phase 2 — Query (per request)

User question

→

Embed question

→

Search vector store
top-K similar chunks

→

Build prompt
chunks + question

→

Claude

→

Answer

The key insight: you never stuff all your documents into Claude's context. You use embeddings — numerical representations of meaning — to find the small subset of content most relevant to the current question, and inject only that. Claude sees the question plus the retrieved context and answers from it.

Chunking — Getting It Right

How you split documents into chunks is one of the most important decisions in a RAG system. Chunks that are too small lose context; chunks that are too large dilute relevance and waste context window space. A common approach is fixed-size chunks with overlap:

Document split into 200-token chunks with 50-token overlap

Chunk 1
tokens 0–200

overlap

Chunk 2
tokens 150–350

overlap

Chunk 3 ★
tokens 300–500

overlap

Chunk 4
tokens 450–650

Chunk 3 matched the query. Overlap ensures sentences at boundaries appear in at least two chunks — reducing the chance that relevant content is split across a boundary and missed.

Chunking strategies

Fixed size with overlap — simple, predictable. Good default for prose documents. Typical: 200–500 tokens per chunk, 10–20% overlap.
Paragraph / section boundaries — split at natural breaks (double newlines, headings). Preserves semantic coherence better than fixed size.
Recursive character splitting — try paragraph breaks first; fall back to sentence breaks, then fixed size. Used by LangChain's RecursiveCharacterTextSplitter.
Semantic chunking — embed sentences, group those whose embeddings are similar. Higher quality; more expensive to compute.

Building a Simple RAG Pipeline

rag_pipeline.py — indexing + querying with sentence-transformers + FAISSpython

      import anthropic, faiss, numpy as np

      from sentence_transformers import SentenceTransformer

      embed_model = SentenceTransformer("all-MiniLM-L6-v2")

      claude = anthropic.Anthropic()

      # ── Indexing phase ────────────────────────────────────────

      def chunk_text(text: str, size=300, overlap=50) -> list[str]:

          words = text.split()

          return [

              " ".join(words[i:i + size])

              for i in range(0, len(words), size - overlap)

          ]

      def build_index(documents: list[str]):

          chunks = []

          for doc in documents:

              chunks.extend(chunk_text(doc))

          embeddings = embed_model.encode(chunks, convert_to_numpy=True)

          index = faiss.IndexFlatL2(embeddings.shape[1])

          index.add(embeddings.astype("float32"))

          return index, chunks

      # ── Query phase ───────────────────────────────────────────

      def retrieve(question: str, index, chunks: list[str], k=4) -> list[str]:

          q_embed = embed_model.encode([question], convert_to_numpy=True).astype("float32")

          _, indices = index.search(q_embed, k)

          return [chunks[i] for i in indices[0]]

      def ask(question: str, index, chunks: list[str]) -> str:

          context_chunks = retrieve(question, index, chunks)

          context = "\n\n---\n\n".join(context_chunks)

          prompt = f"""Use the following context to answer the question. \

      If the answer is not in the context, say so — do not make up information.

      Context:

      {context}

      Question: {question}"""

          response = claude.messages.create(

              model="claude-sonnet-4-6",

              max_tokens=1024,

              messages=[{"role": "user", "content": prompt}]

          )

          return response.content[0].text

      # ── Example usage ─────────────────────────────────────────

      docs = [open(p).read() for p in ["policy.txt", "manual.txt"]]

      idx, ch = build_index(docs)

      print(ask("What is the refund policy for digital products?", idx, ch))

The RAG Prompt — Grounding Claude in Retrieved Context

The prompt structure matters. Explicitly tell Claude to answer only from the provided context and to admit when the answer isn't there. This prevents the model from blending retrieved facts with training knowledge in ways you can't verify:

Effective RAG prompt structuretext

      You are a helpful assistant that answers questions using only the

      provided context. If the context does not contain the answer,

      respond with: "I don't have information about that in the documents

      provided." Do not use your general knowledge.

      Context:

      [CHUNK 1]

      ---

      [CHUNK 2]

      ---

      [CHUNK 3]

      Question: [user question]

Ask Claude to cite its sources

Add "cite which part of the context supports your answer" to the prompt. This forces Claude to stay grounded — it can only cite text it actually received — and lets you display sources to the user to build trust. If Claude can't find a citation, it almost certainly shouldn't be making the claim.

Choosing a Vector Store

Option	Best for	Persistent?	Setup
FAISS (in-memory)	Prototyping, small datasets (<100k chunks), single-process apps	Save/load	`pip install faiss-cpu`
ChromaDB	Local development, moderate scale, easy API	Yes	`pip install chromadb`
pgvector	Production apps already using PostgreSQL — vectors live alongside your data	Yes	Postgres extension + `psycopg2`
Pinecone / Weaviate	Large scale, managed cloud, multi-tenant SaaS	Yes	Managed service — API key

Common RAG Failure Modes

Wrong chunks retrieved

The embedding model doesn't capture the question's intent — irrelevant chunks ranked higher than relevant ones. Fix: try a better embedding model, add keyword search as a fallback (hybrid search), or rephrase the query before embedding.

Answer split across chunks

The relevant content spans a chunk boundary and neither chunk alone is sufficient. Fix: increase chunk size, increase overlap, or retrieve more chunks (higher k).

Claude ignores the context

Claude falls back on training knowledge instead of the retrieved text, especially when the context contradicts something it "knows". Fix: strengthen the system prompt instruction to use only the provided context.

Stale index

Documents have changed but the index hasn't been rebuilt. Fix: trigger re-indexing when source documents are updated — either on a schedule or via a document change event/webhook.

Context window overflow

Retrieving too many chunks fills the context window, leaving no room for the answer. Fix: limit retrieved chunks, compress them with a summarisation step, or use prompt caching (Chapter 6) for a stable document corpus.

Hallucinated citations

Claude invents source text that was never retrieved. Fix: ask Claude to quote directly from the context rather than paraphrase, and validate that quotes exist in the retrieved chunks.

RAG vs Fine-tuning vs Prompt Caching

Approach	Knowledge source	Updatable?	Cost	Best for
RAG	External documents, retrieved at query time	Yes — re-index	Embedding + Claude API calls	Large, changing document sets; factual Q&A
Prompt caching	Documents injected into every prompt, cached	Yes — invalidates cache	Cache write + 10% on reads	Stable documents, <200k tokens total
Fine-tuning	Baked into model weights during training	Expensive to update	High — training compute	Style/format adaptation; not recommended for factual knowledge

Fine-tuning is rarely the right answer for knowledge

A common misconception is that fine-tuning "teaches" Claude facts from your documents. It doesn't — fine-tuning adjusts style and behaviour, not factual recall, and fine-tuned models hallucinate just as confidently. For knowledge from your own data, RAG or prompt injection are almost always better choices.

Next — Chapter 9: Evaluating Outputs
How do you know if your AI feature is actually working? Testing Claude-powered apps requires a different approach than traditional software testing — outputs are probabilistic, not deterministic. Chapter 9 covers building eval sets, scoring strategies (LLM-as-judge, rubrics, reference comparisons), catching regressions when you change your prompt, and the metrics that actually predict user satisfaction.