Retrieval-augmented generation (RAG)
Advanced AI Workflows & the Claude API
Claude's training has a knowledge cutoff and no awareness of your private data — your company's documentation, your database records, your product catalogue. RAG solves this by fetching relevant content at query time and injecting it into Claude's context. Rather than training a custom model, you retrieve the right information and let Claude reason over it. This chapter covers how RAG works, how to build a simple pipeline, and the design decisions that separate working RAG from broken RAG.
How RAG Works — The Two Phases
→ vectors
FAISS / pgvector
top-K similar chunks
chunks + question
The key insight: you never stuff all your documents into Claude's context. You use embeddings — numerical representations of meaning — to find the small subset of content most relevant to the current question, and inject only that. Claude sees the question plus the retrieved context and answers from it.
Chunking — Getting It Right
How you split documents into chunks is one of the most important decisions in a RAG system. Chunks that are too small lose context; chunks that are too large dilute relevance and waste context window space. A common approach is fixed-size chunks with overlap:
tokens 0–200
tokens 150–350
tokens 300–500
tokens 450–650
Chunking strategies
- Fixed size with overlap — simple, predictable. Good default for prose documents. Typical: 200–500 tokens per chunk, 10–20% overlap.
- Paragraph / section boundaries — split at natural breaks (double newlines, headings). Preserves semantic coherence better than fixed size.
- Recursive character splitting — try paragraph breaks first; fall back to sentence breaks, then fixed size. Used by LangChain's
RecursiveCharacterTextSplitter. - Semantic chunking — embed sentences, group those whose embeddings are similar. Higher quality; more expensive to compute.
Building a Simple RAG Pipeline
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
claude = anthropic.Anthropic()
# ── Indexing phase ────────────────────────────────────────
def chunk_text(text: str, size=300, overlap=50) -> list[str]:
words = text.split()
return [
" ".join(words[i:i + size])
for i in range(0, len(words), size - overlap)
]
def build_index(documents: list[str]):
chunks = []
for doc in documents:
chunks.extend(chunk_text(doc))
embeddings = embed_model.encode(chunks, convert_to_numpy=True)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings.astype("float32"))
return index, chunks
# ── Query phase ───────────────────────────────────────────
def retrieve(question: str, index, chunks: list[str], k=4) -> list[str]:
q_embed = embed_model.encode([question], convert_to_numpy=True).astype("float32")
_, indices = index.search(q_embed, k)
return [chunks[i] for i in indices[0]]
def ask(question: str, index, chunks: list[str]) -> str:
context_chunks = retrieve(question, index, chunks)
context = "\n\n---\n\n".join(context_chunks)
prompt = f"""Use the following context to answer the question. \
If the answer is not in the context, say so — do not make up information.
Context:
{context}
Question: {question}"""
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# ── Example usage ─────────────────────────────────────────
docs = [open(p).read() for p in ["policy.txt", "manual.txt"]]
idx, ch = build_index(docs)
print(ask("What is the refund policy for digital products?", idx, ch))
The RAG Prompt — Grounding Claude in Retrieved Context
The prompt structure matters. Explicitly tell Claude to answer only from the provided context and to admit when the answer isn't there. This prevents the model from blending retrieved facts with training knowledge in ways you can't verify:
provided context. If the context does not contain the answer,
respond with: "I don't have information about that in the documents
provided." Do not use your general knowledge.
Context:
[CHUNK 1]
---
[CHUNK 2]
---
[CHUNK 3]
Question: [user question]
Choosing a Vector Store
| Option | Best for | Persistent? | Setup |
|---|---|---|---|
| FAISS (in-memory) | Prototyping, small datasets (<100k chunks), single-process apps | Save/load | pip install faiss-cpu |
| ChromaDB | Local development, moderate scale, easy API | Yes | pip install chromadb |
| pgvector | Production apps already using PostgreSQL — vectors live alongside your data | Yes | Postgres extension + psycopg2 |
| Pinecone / Weaviate | Large scale, managed cloud, multi-tenant SaaS | Yes | Managed service — API key |
Common RAG Failure Modes
RAG vs Fine-tuning vs Prompt Caching
| Approach | Knowledge source | Updatable? | Cost | Best for |
|---|---|---|---|---|
| RAG | External documents, retrieved at query time | Yes — re-index | Embedding + Claude API calls | Large, changing document sets; factual Q&A |
| Prompt caching | Documents injected into every prompt, cached | Yes — invalidates cache | Cache write + 10% on reads | Stable documents, <200k tokens total |
| Fine-tuning | Baked into model weights during training | Expensive to update | High — training compute | Style/format adaptation; not recommended for factual knowledge |