Retrieval-augmented generation (RAG)

Advanced AI Workflows & the Claude API

Chapter 8  ·  Retrieval-Augmented Generation (RAG)

Claude's training has a knowledge cutoff and no awareness of your private data — your company's documentation, your database records, your product catalogue. RAG solves this by fetching relevant content at query time and injecting it into Claude's context. Rather than training a custom model, you retrieve the right information and let Claude reason over it. This chapter covers how RAG works, how to build a simple pipeline, and the design decisions that separate working RAG from broken RAG.

How RAG Works — The Two Phases

RAG pipeline — indexing phase (run once) + query phase (run per request)
Phase 1 — Indexing (offline)
Raw documents
Chunk text
Embed chunks
→ vectors
Vector store
FAISS / pgvector
Phase 2 — Query (per request)
User question
Embed question
Search vector store
top-K similar chunks
Build prompt
chunks + question
Claude
Answer

The key insight: you never stuff all your documents into Claude's context. You use embeddings — numerical representations of meaning — to find the small subset of content most relevant to the current question, and inject only that. Claude sees the question plus the retrieved context and answers from it.

Chunking — Getting It Right

How you split documents into chunks is one of the most important decisions in a RAG system. Chunks that are too small lose context; chunks that are too large dilute relevance and waste context window space. A common approach is fixed-size chunks with overlap:

Document split into 200-token chunks with 50-token overlap
Chunk 1
tokens 0–200
overlap
Chunk 2
tokens 150–350
overlap
Chunk 3 ★
tokens 300–500
overlap
Chunk 4
tokens 450–650
Chunk 3 matched the query. Overlap ensures sentences at boundaries appear in at least two chunks — reducing the chance that relevant content is split across a boundary and missed.

Chunking strategies

  • Fixed size with overlap — simple, predictable. Good default for prose documents. Typical: 200–500 tokens per chunk, 10–20% overlap.
  • Paragraph / section boundaries — split at natural breaks (double newlines, headings). Preserves semantic coherence better than fixed size.
  • Recursive character splitting — try paragraph breaks first; fall back to sentence breaks, then fixed size. Used by LangChain's RecursiveCharacterTextSplitter.
  • Semantic chunking — embed sentences, group those whose embeddings are similar. Higher quality; more expensive to compute.

Building a Simple RAG Pipeline

rag_pipeline.py — indexing + querying with sentence-transformers + FAISSpython
import anthropic, faiss, numpy as np
from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer("all-MiniLM-L6-v2")
claude = anthropic.Anthropic()

# ── Indexing phase ────────────────────────────────────────
def chunk_text(text: str, size=300, overlap=50) -> list[str]:
    words = text.split()
    return [
        " ".join(words[i:i + size])
        for i in range(0, len(words), size - overlap)
    ]

def build_index(documents: list[str]):
    chunks = []
    for doc in documents:
        chunks.extend(chunk_text(doc))

    embeddings = embed_model.encode(chunks, convert_to_numpy=True)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings.astype("float32"))
    return index, chunks

# ── Query phase ───────────────────────────────────────────
def retrieve(question: str, index, chunks: list[str], k=4) -> list[str]:
    q_embed = embed_model.encode([question], convert_to_numpy=True).astype("float32")
    _, indices = index.search(q_embed, k)
    return [chunks[i] for i in indices[0]]

def ask(question: str, index, chunks: list[str]) -> str:
    context_chunks = retrieve(question, index, chunks)
    context = "\n\n---\n\n".join(context_chunks)

    prompt = f"""Use the following context to answer the question. \
If the answer is not in the context, say so — do not make up information.

Context:
{context}

Question: {question}"""


    response = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# ── Example usage ─────────────────────────────────────────
docs = [open(p).read() for p in ["policy.txt", "manual.txt"]]
idx, ch = build_index(docs)
print(ask("What is the refund policy for digital products?", idx, ch))

The RAG Prompt — Grounding Claude in Retrieved Context

The prompt structure matters. Explicitly tell Claude to answer only from the provided context and to admit when the answer isn't there. This prevents the model from blending retrieved facts with training knowledge in ways you can't verify:

Effective RAG prompt structuretext
You are a helpful assistant that answers questions using only the
provided context. If the context does not contain the answer,
respond with: "I don't have information about that in the documents
provided." Do not use your general knowledge.

Context:
[CHUNK 1]
---
[CHUNK 2]
---
[CHUNK 3]

Question: [user question]
Ask Claude to cite its sources
Add "cite which part of the context supports your answer" to the prompt. This forces Claude to stay grounded — it can only cite text it actually received — and lets you display sources to the user to build trust. If Claude can't find a citation, it almost certainly shouldn't be making the claim.

Choosing a Vector Store

OptionBest forPersistent?Setup
FAISS (in-memory) Prototyping, small datasets (<100k chunks), single-process apps Save/load pip install faiss-cpu
ChromaDB Local development, moderate scale, easy API Yes pip install chromadb
pgvector Production apps already using PostgreSQL — vectors live alongside your data Yes Postgres extension + psycopg2
Pinecone / Weaviate Large scale, managed cloud, multi-tenant SaaS Yes Managed service — API key

Common RAG Failure Modes

Wrong chunks retrieved
The embedding model doesn't capture the question's intent — irrelevant chunks ranked higher than relevant ones. Fix: try a better embedding model, add keyword search as a fallback (hybrid search), or rephrase the query before embedding.
Answer split across chunks
The relevant content spans a chunk boundary and neither chunk alone is sufficient. Fix: increase chunk size, increase overlap, or retrieve more chunks (higher k).
Claude ignores the context
Claude falls back on training knowledge instead of the retrieved text, especially when the context contradicts something it "knows". Fix: strengthen the system prompt instruction to use only the provided context.
Stale index
Documents have changed but the index hasn't been rebuilt. Fix: trigger re-indexing when source documents are updated — either on a schedule or via a document change event/webhook.
Context window overflow
Retrieving too many chunks fills the context window, leaving no room for the answer. Fix: limit retrieved chunks, compress them with a summarisation step, or use prompt caching (Chapter 6) for a stable document corpus.
Hallucinated citations
Claude invents source text that was never retrieved. Fix: ask Claude to quote directly from the context rather than paraphrase, and validate that quotes exist in the retrieved chunks.

RAG vs Fine-tuning vs Prompt Caching

ApproachKnowledge sourceUpdatable?CostBest for
RAG External documents, retrieved at query time Yes — re-index Embedding + Claude API calls Large, changing document sets; factual Q&A
Prompt caching Documents injected into every prompt, cached Yes — invalidates cache Cache write + 10% on reads Stable documents, <200k tokens total
Fine-tuning Baked into model weights during training Expensive to update High — training compute Style/format adaptation; not recommended for factual knowledge
Fine-tuning is rarely the right answer for knowledge
A common misconception is that fine-tuning "teaches" Claude facts from your documents. It doesn't — fine-tuning adjusts style and behaviour, not factual recall, and fine-tuned models hallucinate just as confidently. For knowledge from your own data, RAG or prompt injection are almost always better choices.
Next — Chapter 9: Evaluating Outputs
How do you know if your AI feature is actually working? Testing Claude-powered apps requires a different approach than traditional software testing — outputs are probabilistic, not deterministic. Chapter 9 covers building eval sets, scoring strategies (LLM-as-judge, rubrics, reference comparisons), catching regressions when you change your prompt, and the metrics that actually predict user satisfaction.