311: The RAG Retriever Component¶

Chapter Overview

The Retriever is the heart of a RAG system. Its sole function is to search a large corpus of documents and find the specific pieces of text most relevant to a user's query.

The process of building a retriever involves two phases: an offline Indexing phase and an online Querying phase.

Phase 1: Indexing (Offline Process)¶

Indexing is the preparatory step where you process your knowledge base so that it can be searched efficiently.

flowchart TD
    A["Source Documents<br/>(PDFs, HTML, etc.)"] --> B["1. Document Loading"]
    B --> C["2. Chunking<br/>Split into smaller pieces"]
    C --> D["3. Embedding<br/>Convert chunks to vectors"]
    D --> E["4. Storing<br/>Load vectors into a Vector Store"]

    style A fill:#fce4ec,stroke:#c2185b
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#e8f5e8,stroke:#388e3c,stroke-width:2px

Step 1: Document Loading¶

The first step is to load your source documents into the system. These could be: - PDF files - HTML web pages - Word documents - Plain text files - Database records

Step 2: Chunking¶

Large documents are split into smaller, manageable pieces called "chunks". This is crucial because: - Embedding models have token limits - Smaller chunks provide more precise retrieval - Better semantic coherence within each chunk

Common chunking strategies: - Fixed-size chunking: Split by character count (e.g., 1000 characters) - Semantic chunking: Split by paragraphs, sentences, or sections - Recursive chunking: Hierarchical splitting with overlap

Step 3: Embedding¶

Each chunk is converted into a high-dimensional vector (embedding) that captures its semantic meaning. Popular embedding models include: - OpenAI's text-embedding-ada-002 - Sentence-BERT - BGE (Beijing Academy of Artificial Intelligence)

Step 4: Storing¶

The embeddings are stored in a vector database for efficient similarity search: - Pinecone: Managed vector database - Weaviate: Open-source vector database - Chroma: Simple, lightweight option - FAISS: Facebook's similarity search library

Phase 2: Querying (Online Process)¶

When a user asks a question, the retriever searches through the indexed knowledge base to find relevant information.

flowchart TD
    A[User Query:<br/>'What is the company vacation policy?'] --> B(1. Query Embedding<br/>Convert query to vector)
    B --> C(2. Similarity Search<br/>Find most similar chunks)
    C --> D(3. Ranking & Filtering<br/>Score and select top results)
    D --> E[4. Context Assembly<br/>Combine retrieved chunks]
    E --> F[Retrieved Context<br/>Sent to Generator]

    style A fill:#e1f5fe,stroke:#0277bd
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#f3e5f5,stroke:#7b1fa2
    style F fill:#e8f5e8,stroke:#388e3c,stroke-width:2px

Step 1: Query Embedding¶

The user's query is converted into a vector using the same embedding model used during indexing.

Step 2: Similarity Search¶

The query vector is compared against all stored chunk vectors using similarity metrics like: - Cosine similarity: Most common for text - Euclidean distance: Geometric distance - Dot product: Fast but less normalized

Step 3: Ranking & Filtering¶

Retrieved chunks are: - Ranked by similarity score - Filtered by minimum threshold - Limited to top-k results (typically 3-10)

Step 4: Context Assembly¶

The selected chunks are combined into a coherent context that will be passed to the Generator.

Advanced Retrieval Techniques¶

Hybrid Search¶

Combines multiple search methods: - Dense retrieval: Vector similarity (semantic) - Sparse retrieval: Keyword matching (BM25) - Reranking: Secondary model to improve results

Query Enhancement¶

Improves retrieval by modifying the query: - Query expansion: Add related terms - Query rewriting: Rephrase for better matching - Multi-query: Generate multiple query variants

Metadata Filtering¶

Add structured filters to narrow search: - Document type - Date ranges - Author or source - Topic categories

Key Design Considerations¶

Chunk Size vs. Granularity¶

Larger chunks: More context but less precise
Smaller chunks: More precise but may lack context
Optimal size: Usually 200-1000 tokens

Embedding Model Selection¶

Consider: - Domain specificity: General vs. specialized models - Language support: Multilingual capabilities - Performance: Speed vs. accuracy trade-offs

Vector Database Choice¶

Factors to evaluate: - Scale: How many documents? - Performance: Query latency requirements - Cost: Hosted vs. self-managed - Features: Filtering, analytics, etc.

Common Challenges & Solutions¶

Challenge 1: Poor Retrieval Quality¶

Symptoms: Irrelevant or low-quality chunks retrieved Solutions: - Improve chunking strategy - Use domain-specific embeddings - Implement reranking - Add metadata filtering

Challenge 2: Slow Query Performance¶

Symptoms: High latency during search Solutions: - Optimize vector database configuration - Use approximate search algorithms - Implement caching - Reduce embedding dimensions

Challenge 3: Missing Context¶

Symptoms: Retrieved chunks lack necessary context Solutions: - Increase chunk size - Add chunk overlap - Implement hierarchical retrieval - Store parent-child relationships

Next Steps¶

The retrieved context from this component is passed to the the RAG Generator, which uses it to produce the final answer for the user.

Understanding retrieval is crucial because the quality of your RAG system is fundamentally limited by the quality of retrieved information - "garbage in, garbage out."