Verified by Garnet Grid

How to Implement RAG (Retrieval-Augmented Generation)

Build production RAG pipelines. Covers chunking strategies, embedding models, vector stores, retrieval techniques, evaluation, and common failure modes.

RAG is the highest-ROI pattern for enterprise AI right now. It lets you ground LLM responses in your own data without fine-tuning. But naive implementations fail badly — here’s how to build one that actually works.


Architecture Overview

Documents → Chunk → Embed → Store (Vector DB)

User Query → Embed → Search → Top-K Results

                         Prompt (Context + Query) → LLM → Response

Step 1: Document Ingestion and Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunking strategy matters more than model choice
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # Tokens (not characters)
    chunk_overlap=50,         # Overlap prevents context loss at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

# Process documents
chunks = splitter.split_documents(documents)

# Add metadata for filtering
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": doc.metadata["filename"],
        "chunk_index": i,
        "doc_type": "policy",      # Enables filtered search
        "department": "engineering",
    })

Chunking Decision Matrix

Document TypeChunk SizeOverlapStrategy
Technical docs512 tokens50Recursive (headers → paragraphs)
Legal / policy1024 tokens100Paragraph-level
Code filesFunction-level0AST-based splitting
Q&A / FAQ256 tokens0One chunk per question
Emails / chat512 tokens25Message-level

Step 2: Embedding and Vector Storage

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_texts(texts: list[str], model="text-embedding-3-small") -> list:
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

# Embedding model comparison
models = {
    "text-embedding-3-small": {"dim": 1536, "cost": "$0.02/1M tokens", "quality": "Good"},
    "text-embedding-3-large": {"dim": 3072, "cost": "$0.13/1M tokens", "quality": "Best"},
    "text-embedding-ada-002":  {"dim": 1536, "cost": "$0.10/1M tokens", "quality": "Legacy"},
}

Pinecone Integration

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")

# Upsert embeddings with metadata
vectors = []
for i, chunk in enumerate(chunks):
    embedding = embed_texts([chunk.page_content])[0]
    vectors.append({
        "id": f"doc-{chunk.metadata['source']}-{i}",
        "values": embedding,
        "metadata": {
            "text": chunk.page_content,
            "source": chunk.metadata["source"],
            "doc_type": chunk.metadata["doc_type"],
        }
    })

index.upsert(vectors=vectors, namespace="docs")

Step 3: Retrieval

def retrieve(query: str, top_k: int = 5, filters: dict = None) -> list:
    query_embedding = embed_texts([query])[0]

    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filters,          # e.g., {"doc_type": "policy"}
        namespace="docs",
    )

    return [
        {
            "text": match.metadata["text"],
            "source": match.metadata["source"],
            "score": match.score,
        }
        for match in results.matches
    ]

Hybrid Search (Dense + Sparse)

# Combine semantic search with keyword matching
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus)

def hybrid_search(query, alpha=0.7):
    """alpha=1.0 is pure semantic, alpha=0.0 is pure keyword"""
    dense = embed_texts([query])[0]
    sparse = bm25.encode_queries(query)

    results = index.query(
        vector=dense,
        sparse_vector=sparse,
        top_k=10,
        include_metadata=True,
    )
    return results

Step 4: Generation with Context

def rag_query(user_question: str) -> str:
    # Retrieve relevant chunks
    context_chunks = retrieve(user_question, top_k=5)
    context = "\n\n---\n\n".join([c["text"] for c in context_chunks])

    # Build prompt
    prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."

Context:
{context}

Question: {user_question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,  # Low temperature for factual responses
    )

    return response.choices[0].message.content

Step 5: Evaluation

MetricWhat It MeasuresHow to Calculate
Retrieval PrecisionAre retrieved chunks relevant?Manual review of top-K
Retrieval RecallAre all relevant chunks found?Compare to ground truth set
Answer FaithfulnessDoes answer match the context?LLM-as-judge evaluation
Answer RelevanceDoes answer address the question?LLM-as-judge evaluation
Hallucination RateClaims not in context?Manual + LLM check

Common Failure Modes

FailureCauseFix
Irrelevant retrievalChunks too large / wrong embeddingReduce chunk size, try different model
HallucinationsNo context match, model fills gapsAdd “I don’t know” instruction, lower temp
Missing contextImportant info split across chunksIncrease overlap, use parent-child chunks
Stale answersSource docs not updatedAutomated re-indexing pipeline
Slow responseLarge context windowReduce top-K, use reranking

RAG Checklist

  • Chunking strategy defined and tested
  • Embedding model selected and benchmarked
  • Vector store deployed with metadata filtering
  • Retrieval tested (precision > 80% on sample queries)
  • Prompt template includes grounding instructions
  • Hallucination mitigation in place
  • Re-indexing pipeline for source document updates
  • Evaluation framework with ground truth dataset
  • Monitoring: track retrieval scores and user feedback
  • Cost projection: embedding + LLM costs per query

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI engineering consulting, visit garnetgrid.com. :::