Skip to content

311: The RAG Retriever Component

Chapter Overview

The Retriever is the heart of a RAG system. Its sole function is to search a large corpus of documents and find the specific pieces of text most relevant to a user's query.

The process of building a retriever involves two phases: an offline Indexing phase and an online Querying phase.


Phase 1: Indexing (Offline Process)

Indexing is the preparatory step where you process your knowledge base so that it can be searched efficiently.

flowchart TD
    A["Source Documents<br/>(PDFs, HTML, etc.)"] --> B["1. Document Loading"]
    B --> C["2. Chunking<br/>Split into smaller pieces"]
    C --> D["3. Embedding<br/>Convert chunks to vectors"]
    D --> E["4. Storing<br/>Load vectors into a Vector Store"]

    style A fill:#fce4ec,stroke:#c2185b
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#e8f5e8,stroke:#388e3c,stroke-width:2px

Step 1: Document Loading

The first step is to load your source documents into the system. These could be: - PDF files - HTML web pages - Word documents - Plain text files - Database records

Step 2: Chunking

Large documents are split into smaller, manageable pieces called "chunks". This is crucial because: - Embedding models have token limits - Smaller chunks provide more precise retrieval - Better semantic coherence within each chunk

Common chunking strategies: - Fixed-size chunking: Split by character count (e.g., 1000 characters) - Semantic chunking: Split by paragraphs, sentences, or sections - Recursive chunking: Hierarchical splitting with overlap

Step 3: Embedding

Each chunk is converted into a high-dimensional vector (embedding) that captures its semantic meaning. Popular embedding models include: - OpenAI's text-embedding-ada-002 - Sentence-BERT - BGE (Beijing Academy of Artificial Intelligence)

Step 4: Storing

The embeddings are stored in a vector database for efficient similarity search: - Pinecone: Managed vector database - Weaviate: Open-source vector database - Chroma: Simple, lightweight option - FAISS: Facebook's similarity search library


Phase 2: Querying (Online Process)

When a user asks a question, the retriever searches through the indexed knowledge base to find relevant information.

flowchart TD
    A[User Query:<br/>'What is the company vacation policy?'] --> B(1. Query Embedding<br/>Convert query to vector)
    B --> C(2. Similarity Search<br/>Find most similar chunks)
    C --> D(3. Ranking & Filtering<br/>Score and select top results)
    D --> E[4. Context Assembly<br/>Combine retrieved chunks]
    E --> F[Retrieved Context<br/>Sent to Generator]

    style A fill:#e1f5fe,stroke:#0277bd
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#f3e5f5,stroke:#7b1fa2
    style F fill:#e8f5e8,stroke:#388e3c,stroke-width:2px

Step 1: Query Embedding

The user's query is converted into a vector using the same embedding model used during indexing.

The query vector is compared against all stored chunk vectors using similarity metrics like: - Cosine similarity: Most common for text - Euclidean distance: Geometric distance - Dot product: Fast but less normalized

Step 3: Ranking & Filtering

Retrieved chunks are: - Ranked by similarity score - Filtered by minimum threshold - Limited to top-k results (typically 3-10)

Step 4: Context Assembly

The selected chunks are combined into a coherent context that will be passed to the Generator.


Advanced Retrieval Techniques

Combines multiple search methods: - Dense retrieval: Vector similarity (semantic) - Sparse retrieval: Keyword matching (BM25) - Reranking: Secondary model to improve results

Query Enhancement

Improves retrieval by modifying the query: - Query expansion: Add related terms - Query rewriting: Rephrase for better matching - Multi-query: Generate multiple query variants

Metadata Filtering

Add structured filters to narrow search: - Document type - Date ranges - Author or source - Topic categories


Key Design Considerations

Chunk Size vs. Granularity

  • Larger chunks: More context but less precise
  • Smaller chunks: More precise but may lack context
  • Optimal size: Usually 200-1000 tokens

Embedding Model Selection

Consider: - Domain specificity: General vs. specialized models - Language support: Multilingual capabilities - Performance: Speed vs. accuracy trade-offs

Vector Database Choice

Factors to evaluate: - Scale: How many documents? - Performance: Query latency requirements - Cost: Hosted vs. self-managed - Features: Filtering, analytics, etc.


Common Challenges & Solutions

Challenge 1: Poor Retrieval Quality

Symptoms: Irrelevant or low-quality chunks retrieved Solutions: - Improve chunking strategy - Use domain-specific embeddings - Implement reranking - Add metadata filtering

Challenge 2: Slow Query Performance

Symptoms: High latency during search Solutions: - Optimize vector database configuration - Use approximate search algorithms - Implement caching - Reduce embedding dimensions

Challenge 3: Missing Context

Symptoms: Retrieved chunks lack necessary context Solutions: - Increase chunk size - Add chunk overlap - Implement hierarchical retrieval - Store parent-child relationships


Next Steps

The retrieved context from this component is passed to the the RAG Generator, which uses it to produce the final answer for the user.

Understanding retrieval is crucial because the quality of your RAG system is fundamentally limited by the quality of retrieved information - "garbage in, garbage out."