How to Implement RAG (Retrieval-Augmented Generation)
Build production RAG pipelines. Covers chunking strategies, embedding models, vector stores, retrieval techniques, evaluation, and common failure modes.
RAG is the highest-ROI pattern for enterprise AI right now. It lets you ground LLM responses in your own data without fine-tuning. But naive implementations fail badly — here’s how to build one that actually works.
Architecture Overview
Documents → Chunk → Embed → Store (Vector DB)
↓
User Query → Embed → Search → Top-K Results
↓
Prompt (Context + Query) → LLM → Response
Step 1: Document Ingestion and Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Chunking strategy matters more than model choice
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Tokens (not characters)
chunk_overlap=50, # Overlap prevents context loss at boundaries
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
# Process documents
chunks = splitter.split_documents(documents)
# Add metadata for filtering
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": doc.metadata["filename"],
"chunk_index": i,
"doc_type": "policy", # Enables filtered search
"department": "engineering",
})
Chunking Decision Matrix
| Document Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| Technical docs | 512 tokens | 50 | Recursive (headers → paragraphs) |
| Legal / policy | 1024 tokens | 100 | Paragraph-level |
| Code files | Function-level | 0 | AST-based splitting |
| Q&A / FAQ | 256 tokens | 0 | One chunk per question |
| Emails / chat | 512 tokens | 25 | Message-level |
Step 2: Embedding and Vector Storage
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed_texts(texts: list[str], model="text-embedding-3-small") -> list:
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
# Embedding model comparison
models = {
"text-embedding-3-small": {"dim": 1536, "cost": "$0.02/1M tokens", "quality": "Good"},
"text-embedding-3-large": {"dim": 3072, "cost": "$0.13/1M tokens", "quality": "Best"},
"text-embedding-ada-002": {"dim": 1536, "cost": "$0.10/1M tokens", "quality": "Legacy"},
}
Pinecone Integration
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")
# Upsert embeddings with metadata
vectors = []
for i, chunk in enumerate(chunks):
embedding = embed_texts([chunk.page_content])[0]
vectors.append({
"id": f"doc-{chunk.metadata['source']}-{i}",
"values": embedding,
"metadata": {
"text": chunk.page_content,
"source": chunk.metadata["source"],
"doc_type": chunk.metadata["doc_type"],
}
})
index.upsert(vectors=vectors, namespace="docs")
Step 3: Retrieval
def retrieve(query: str, top_k: int = 5, filters: dict = None) -> list:
query_embedding = embed_texts([query])[0]
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=filters, # e.g., {"doc_type": "policy"}
namespace="docs",
)
return [
{
"text": match.metadata["text"],
"source": match.metadata["source"],
"score": match.score,
}
for match in results.matches
]
Hybrid Search (Dense + Sparse)
# Combine semantic search with keyword matching
from pinecone_text.sparse import BM25Encoder
bm25 = BM25Encoder()
bm25.fit(corpus)
def hybrid_search(query, alpha=0.7):
"""alpha=1.0 is pure semantic, alpha=0.0 is pure keyword"""
dense = embed_texts([query])[0]
sparse = bm25.encode_queries(query)
results = index.query(
vector=dense,
sparse_vector=sparse,
top_k=10,
include_metadata=True,
)
return results
Step 4: Generation with Context
def rag_query(user_question: str) -> str:
# Retrieve relevant chunks
context_chunks = retrieve(user_question, top_k=5)
context = "\n\n---\n\n".join([c["text"] for c in context_chunks])
# Build prompt
prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Context:
{context}
Question: {user_question}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.1, # Low temperature for factual responses
)
return response.choices[0].message.content
Step 5: Evaluation
| Metric | What It Measures | How to Calculate |
|---|---|---|
| Retrieval Precision | Are retrieved chunks relevant? | Manual review of top-K |
| Retrieval Recall | Are all relevant chunks found? | Compare to ground truth set |
| Answer Faithfulness | Does answer match the context? | LLM-as-judge evaluation |
| Answer Relevance | Does answer address the question? | LLM-as-judge evaluation |
| Hallucination Rate | Claims not in context? | Manual + LLM check |
Common Failure Modes
| Failure | Cause | Fix |
|---|---|---|
| Irrelevant retrieval | Chunks too large / wrong embedding | Reduce chunk size, try different model |
| Hallucinations | No context match, model fills gaps | Add “I don’t know” instruction, lower temp |
| Missing context | Important info split across chunks | Increase overlap, use parent-child chunks |
| Stale answers | Source docs not updated | Automated re-indexing pipeline |
| Slow response | Large context window | Reduce top-K, use reranking |
RAG Checklist
- Chunking strategy defined and tested
- Embedding model selected and benchmarked
- Vector store deployed with metadata filtering
- Retrieval tested (precision > 80% on sample queries)
- Prompt template includes grounding instructions
- Hallucination mitigation in place
- Re-indexing pipeline for source document updates
- Evaluation framework with ground truth dataset
- Monitoring: track retrieval scores and user feedback
- Cost projection: embedding + LLM costs per query
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI engineering consulting, visit garnetgrid.com. :::