313: Vector Stores & Dense Passage Retrieval (DPR)¶

Chapter Overview

The Retriever's ability to find relevant information quickly and accurately depends on two key technologies: 1. A high-quality embedding model to convert text into meaningful numerical vectors. 2. A high-performance vector store to store and search through those vectors.

Dense Passage Retrieval (DPR) and FAISS are cornerstone examples of these technologies.

Dense Passage Retrieval (DPR): The Embedding Model¶

DPR is a specialized model architecture, usually based on BERT, designed specifically for the task of retrieval. It's not a single model, but a pair of models trained together:

Context Encoder: This model is an expert at creating vector representations (embeddings) for passages of text (the potential answers).
Question Encoder: This model is an expert at creating embeddings for questions.

How DPR is Trained¶

The two encoders are trained using a technique called contrastive learning. The goal is to train the models so that the vector for a question is mathematically very close (high dot-product similarity) to the vector of its correct answer passage, while simultaneously being very far from the vectors of all other "negative" (incorrect) passages.

This specialized training makes DPR highly effective for building the semantic search capability at the heart of RAG.

FAISS: The Vector Search Library¶

A Vector Store (or Vector Database) is a system designed for the efficient storage and searching of high-dimensional vectors. While a simple retriever could just calculate the distance between a query vector and every document vector one-by-one, this is impossibly slow for millions of documents.

FAISS (Facebook AI Similarity Search) is a highly optimized open-source library that provides the algorithms for this efficient search. It is the engine inside many popular vector stores.

How FAISS Works (High-Level)¶

Instead of a brute-force search, FAISS uses indexing techniques to speed up the process dramatically. A common approach is Approximate Nearest Neighbor (ANN) search.

flowchart TD
    subgraph BruteForce ["Brute-Force Search (Slow)"]
        A["Query Vector"] --> B["Compare with<br/>Doc Vector 1"]
        A --> C["Compare with<br/>Doc Vector 2"]
        A --> D["Compare with<br/>Doc Vector ...n"]
    end

    subgraph FAISS ["FAISS ANN Search (Fast)"]
        E["Query Vector"] --> F["Find a few<br/>'candidate' clusters"]
        F --> G["Search only within<br/>those clusters"]
    end

    G --> H["⚡️ Fast, Approximate Results"]

    style A fill:#e3f2fd,stroke:#1976d2
    style E fill:#e3f2fd,stroke:#1976d2
    style D fill:#ffcdd2,stroke:#B71C1C
    style H fill:#c8e6c9,stroke:#1B5E20

FAISS Index Types¶

FAISS offers different index types optimized for different use cases:

1. Flat Index (Exact Search)¶

Use case: Small datasets (< 10K vectors)
Pros: Perfect accuracy, simple
Cons: Slow for large datasets
Code: IndexFlatIP (Inner Product), IndexFlatL2 (L2 distance)

2. IVF (Inverted File)¶

Use case: Medium datasets (10K - 1M vectors)
Pros: Good balance of speed and accuracy
Cons: Requires training step
Code: IndexIVFFlat, IndexIVFPQ

3. HNSW (Hierarchical Navigable Small World)¶

Use case: Large datasets (1M+ vectors)
Pros: Very fast, good accuracy
Cons: Higher memory usage
Code: IndexHNSWFlat

4. PQ (Product Quantization)¶

Use case: Memory-constrained environments
Pros: Very low memory usage
Cons: Reduced accuracy
Code: IndexPQ, IndexIVFPQ

DPR Training Process¶

Step 1: Data Preparation¶

DPR requires training triplets: - Question: User query - Positive Passage: Correct answer passage - Negative Passages: Incorrect but plausible passages

flowchart TD
    A["Training Data"] --> B["Question:<br/>'What is the capital of France?'"]
    A --> C["Positive Passage:<br/>'Paris is the capital and largest city of France...'"]
    A --> D["Negative Passages:<br/>'London is the capital of England...'<br/>'Madrid is the capital of Spain...'"]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#c8e6c9,stroke:#1B5E20
    style D fill:#ffcdd2,stroke:#B71C1C

Step 2: Dual Encoder Architecture¶

flowchart TD
    subgraph QuestionEncoder ["Question Encoder (BERT)"]
        A["Question Text"] --> B["BERT Layers"] --> C["Question Vector"]
    end

    subgraph ContextEncoder ["Context Encoder (BERT)"]
        D["Passage Text"] --> E["BERT Layers"] --> F["Passage Vector"]
    end

    subgraph SimilarityComputation ["Similarity Computation"]
        C --> G["Dot Product<br/>Similarity Score"]
        F --> G
    end

    G --> H["Loss Function<br/>(Contrastive Learning)"]

    style C fill:#e3f2fd,stroke:#1976d2
    style F fill:#fff3e0,stroke:#f57c00
    style G fill:#f3e5f5,stroke:#7b1fa2

Step 3: Contrastive Learning Loss¶

The loss function ensures: - High similarity between question and positive passage - Low similarity between question and negative passages

\[\mathcal{L} = -\log \frac{e^{sim(q, p^+)}}{\sum_{i} e^{sim(q, p_i)}}\]

Where: - \(q\) = question vector - \(p^+\) = positive passage vector - \(p_i\) = all passage vectors (positive + negatives)

Vector Store Implementation Patterns¶

1. Embedding-First Approach¶

# Pseudo-code for embedding-first pattern
def build_index(documents):
    embeddings = []
    for doc in documents:
        embedding = embedding_model.encode(doc)
        embeddings.append(embedding)

    # Build FAISS index
    index = faiss.IndexFlatIP(embedding_dimension)
    index.add(embeddings)
    return index

def search(query, index, k=5):
    query_embedding = embedding_model.encode(query)
    scores, indices = index.search(query_embedding, k)
    return indices, scores

2. Streaming Approach¶

# Pseudo-code for streaming pattern
def build_streaming_index(document_stream):
    index = faiss.IndexFlatIP(embedding_dimension)

    for batch in document_stream:
        embeddings = embedding_model.encode(batch)
        index.add(embeddings)

    return index

3. Hybrid Search Pattern¶

Combines dense (vector) and sparse (keyword) search:

flowchart TD
    A["User Query"] --> B["Dense Search<br/>(Vector Similarity)"]
    A --> C["Sparse Search<br/>(BM25/TF-IDF)"]

    B --> D["Dense Results<br/>(Semantic matches)"]
    C --> E["Sparse Results<br/>(Keyword matches)"]

    D --> F["Result Fusion<br/>(Weighted combination)"]
    E --> F

    F --> G["Final Ranked Results"]

    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#fff3e0,stroke:#f57c00
    style G fill:#c8e6c9,stroke:#1B5E20

Performance Optimization Strategies¶

1. Index Optimization¶

flowchart TD
    A["Choose Index Type"] --> B["Flat: < 10K vectors"]
    A --> C["IVF: 10K-1M vectors"]
    A --> D["HNSW: > 1M vectors"]

    B --> E["Perfect Accuracy<br/>Slower Search"]
    C --> F["Good Balance<br/>Requires Training"]
    D --> G["Fast Search<br/>Higher Memory"]

    style B fill:#c8e6c9,stroke:#1B5E20
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e3f2fd,stroke:#1976d2

2. Embedding Optimization¶

Dimension Reduction: Use techniques like PCA to reduce vector dimensions
Quantization: Trade accuracy for memory/speed using product quantization
Batch Processing: Process multiple queries simultaneously

3. Caching Strategies¶

Query Caching: Cache frequently asked queries
Embedding Caching: Cache computed embeddings
Result Caching: Cache search results for common patterns

Advanced DPR Techniques¶

1. Hard Negative Mining¶

Improve training by finding challenging negative examples:

# Pseudo-code for hard negative mining
def mine_hard_negatives(question, positive_passage, candidate_passages):
    question_embedding = question_encoder(question)

    # Find passages that are similar to question but not the correct answer
    similarities = []
    for passage in candidate_passages:
        if passage != positive_passage:
            passage_embedding = context_encoder(passage)
            sim = dot_product(question_embedding, passage_embedding)
            similarities.append((passage, sim))

    # Select top-k most similar (but incorrect) passages as hard negatives
    hard_negatives = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
    return [passage for passage, _ in hard_negatives]

2. Cross-Encoder Reranking¶

Use a more sophisticated model to rerank initial DPR results:

flowchart TD
    A["Query"] --> B["DPR Retrieval<br/>(Fast, Approximate)"]
    B --> C["Top-100 Candidates"]
    C --> D["Cross-Encoder Reranking<br/>(Slow, Accurate)"]
    D --> E["Top-10 Final Results"]

    style B fill:#e3f2fd,stroke:#1976d2
    style D fill:#fff3e0,stroke:#f57c00
    style E fill:#c8e6c9,stroke:#1B5E20

3. Multi-Vector Approaches¶

Use multiple embedding models for different aspects: - ColBERT: Token-level embeddings for fine-grained matching - Multi-aspect embeddings: Separate vectors for different semantic aspects - Ensemble retrieval: Combine multiple retrieval models

Evaluation Metrics¶

1. Retrieval Accuracy¶

Recall@k: Fraction of relevant documents in top-k results
Precision@k: Fraction of retrieved documents that are relevant
Mean Reciprocal Rank (MRR): Average of reciprocal ranks of first relevant result

2. Efficiency Metrics¶

Query Latency: Time to process a single query
Throughput: Queries processed per second
Memory Usage: RAM required for index storage
Index Build Time: Time to construct the search index

3. End-to-End RAG Metrics¶

Answer Accuracy: Correctness of final generated answers
Faithfulness: How well answers stick to retrieved context
Relevance: How well retrieved context addresses the query

Common Challenges & Solutions¶

Challenge 1: Cold Start Problem¶

Problem: Poor performance with limited training data Solutions: - Use pre-trained models (e.g., sentence-transformers) - Domain adaptation techniques - Synthetic data generation - Transfer learning from related domains

Challenge 2: Out-of-Domain Queries¶

Problem: Poor retrieval for queries outside training distribution Solutions: - Diverse training data - Query augmentation techniques - Hybrid search (combine dense + sparse) - Continuous learning from user feedback

Challenge 3: Scalability Issues¶

Problem: Performance degradation with growing document collections Solutions: - Hierarchical indexing - Distributed search architectures - Approximate search algorithms - Efficient index compression

Challenge 4: Embedding Quality¶

Problem: Embeddings don't capture domain-specific semantics Solutions: - Domain-specific fine-tuning - Contrastive learning on domain data - Multi-task learning approaches - Regular model evaluation and updates

Integration with RAG Pipeline¶

Vector stores and DPR fit into the broader [[310-Retrieval-Augmented-Generation-RAG|RAG]] architecture:

flowchart TD
    A["Documents"] --> B["Text Preprocessing"]
    B --> C["DPR Context Encoder"]
    C --> D["Vector Store (FAISS)"]

    E["User Query"] --> F["DPR Question Encoder"]
    F --> G["Query Vector"]
    G --> H["Vector Search"]
    D --> H

    H --> I["Retrieved Passages"]
    I --> J["Generator (LLM)"]
    J --> K["Final Answer"]

    style D fill:#e3f2fd,stroke:#1976d2
    style H fill:#fff3e0,stroke:#f57c00
    style K fill:#c8e6c9,stroke:#1B5E20

Next Steps¶

Understanding vector stores and DPR provides the foundation for building efficient retrieval systems. The next component to explore is the RAG Generator, which takes the retrieved context and produces coherent answers.

The combination of high-quality embeddings (DPR) and efficient search (FAISS) enables RAG systems to scale to millions of documents while maintaining fast response times and high accuracy.