Skip to content

313: Vector Stores & Dense Passage Retrieval (DPR)

Chapter Overview

The Retriever's ability to find relevant information quickly and accurately depends on two key technologies: 1. A high-quality embedding model to convert text into meaningful numerical vectors. 2. A high-performance vector store to store and search through those vectors.

Dense Passage Retrieval (DPR) and FAISS are cornerstone examples of these technologies.


Dense Passage Retrieval (DPR): The Embedding Model

DPR is a specialized model architecture, usually based on BERT, designed specifically for the task of retrieval. It's not a single model, but a pair of models trained together:

  • Context Encoder: This model is an expert at creating vector representations (embeddings) for passages of text (the potential answers).
  • Question Encoder: This model is an expert at creating embeddings for questions.

How DPR is Trained

The two encoders are trained using a technique called contrastive learning. The goal is to train the models so that the vector for a question is mathematically very close (high dot-product similarity) to the vector of its correct answer passage, while simultaneously being very far from the vectors of all other "negative" (incorrect) passages.

This specialized training makes DPR highly effective for building the semantic search capability at the heart of RAG.

FAISS: The Vector Search Library

A Vector Store (or Vector Database) is a system designed for the efficient storage and searching of high-dimensional vectors. While a simple retriever could just calculate the distance between a query vector and every document vector one-by-one, this is impossibly slow for millions of documents.

FAISS (Facebook AI Similarity Search) is a highly optimized open-source library that provides the algorithms for this efficient search. It is the engine inside many popular vector stores.

How FAISS Works (High-Level)

Instead of a brute-force search, FAISS uses indexing techniques to speed up the process dramatically. A common approach is Approximate Nearest Neighbor (ANN) search.

flowchart TD
    subgraph BruteForce ["Brute-Force Search (Slow)"]
        A["Query Vector"] --> B["Compare with<br/>Doc Vector 1"]
        A --> C["Compare with<br/>Doc Vector 2"]
        A --> D["Compare with<br/>Doc Vector ...n"]
    end

    subgraph FAISS ["FAISS ANN Search (Fast)"]
        E["Query Vector"] --> F["Find a few<br/>'candidate' clusters"]
        F --> G["Search only within<br/>those clusters"]
    end

    G --> H["⚡️ Fast, Approximate Results"]

    style A fill:#e3f2fd,stroke:#1976d2
    style E fill:#e3f2fd,stroke:#1976d2
    style D fill:#ffcdd2,stroke:#B71C1C
    style H fill:#c8e6c9,stroke:#1B5E20

FAISS Index Types

FAISS offers different index types optimized for different use cases:

  • Use case: Small datasets (< 10K vectors)
  • Pros: Perfect accuracy, simple
  • Cons: Slow for large datasets
  • Code: IndexFlatIP (Inner Product), IndexFlatL2 (L2 distance)

2. IVF (Inverted File)

  • Use case: Medium datasets (10K - 1M vectors)
  • Pros: Good balance of speed and accuracy
  • Cons: Requires training step
  • Code: IndexIVFFlat, IndexIVFPQ

3. HNSW (Hierarchical Navigable Small World)

  • Use case: Large datasets (1M+ vectors)
  • Pros: Very fast, good accuracy
  • Cons: Higher memory usage
  • Code: IndexHNSWFlat

4. PQ (Product Quantization)

  • Use case: Memory-constrained environments
  • Pros: Very low memory usage
  • Cons: Reduced accuracy
  • Code: IndexPQ, IndexIVFPQ

DPR Training Process

Step 1: Data Preparation

DPR requires training triplets: - Question: User query - Positive Passage: Correct answer passage - Negative Passages: Incorrect but plausible passages

flowchart TD
    A["Training Data"] --> B["Question:<br/>'What is the capital of France?'"]
    A --> C["Positive Passage:<br/>'Paris is the capital and largest city of France...'"]
    A --> D["Negative Passages:<br/>'London is the capital of England...'<br/>'Madrid is the capital of Spain...'"]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#c8e6c9,stroke:#1B5E20
    style D fill:#ffcdd2,stroke:#B71C1C

Step 2: Dual Encoder Architecture

flowchart TD
    subgraph QuestionEncoder ["Question Encoder (BERT)"]
        A["Question Text"] --> B["BERT Layers"] --> C["Question Vector"]
    end

    subgraph ContextEncoder ["Context Encoder (BERT)"]
        D["Passage Text"] --> E["BERT Layers"] --> F["Passage Vector"]
    end

    subgraph SimilarityComputation ["Similarity Computation"]
        C --> G["Dot Product<br/>Similarity Score"]
        F --> G
    end

    G --> H["Loss Function<br/>(Contrastive Learning)"]

    style C fill:#e3f2fd,stroke:#1976d2
    style F fill:#fff3e0,stroke:#f57c00
    style G fill:#f3e5f5,stroke:#7b1fa2

Step 3: Contrastive Learning Loss

The loss function ensures: - High similarity between question and positive passage - Low similarity between question and negative passages

\[\mathcal{L} = -\log \frac{e^{sim(q, p^+)}}{\sum_{i} e^{sim(q, p_i)}}\]

Where: - \(q\) = question vector - \(p^+\) = positive passage vector - \(p_i\) = all passage vectors (positive + negatives)


Vector Store Implementation Patterns

1. Embedding-First Approach

# Pseudo-code for embedding-first pattern
def build_index(documents):
    embeddings = []
    for doc in documents:
        embedding = embedding_model.encode(doc)
        embeddings.append(embedding)

    # Build FAISS index
    index = faiss.IndexFlatIP(embedding_dimension)
    index.add(embeddings)
    return index

def search(query, index, k=5):
    query_embedding = embedding_model.encode(query)
    scores, indices = index.search(query_embedding, k)
    return indices, scores

2. Streaming Approach

# Pseudo-code for streaming pattern
def build_streaming_index(document_stream):
    index = faiss.IndexFlatIP(embedding_dimension)

    for batch in document_stream:
        embeddings = embedding_model.encode(batch)
        index.add(embeddings)

    return index

3. Hybrid Search Pattern

Combines dense (vector) and sparse (keyword) search:

flowchart TD
    A["User Query"] --> B["Dense Search<br/>(Vector Similarity)"]
    A --> C["Sparse Search<br/>(BM25/TF-IDF)"]

    B --> D["Dense Results<br/>(Semantic matches)"]
    C --> E["Sparse Results<br/>(Keyword matches)"]

    D --> F["Result Fusion<br/>(Weighted combination)"]
    E --> F

    F --> G["Final Ranked Results"]

    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#fff3e0,stroke:#f57c00
    style G fill:#c8e6c9,stroke:#1B5E20

1. Open Source Options

Chroma

  • Best for: Prototyping, small-scale applications
  • Pros: Simple API, lightweight, good for development
  • Cons: Limited scalability
  • Use case: Local development, proof of concepts

Weaviate

  • Best for: Production applications with complex filtering
  • Pros: GraphQL API, built-in ML models, schema support
  • Cons: More complex setup
  • Use case: Enterprise applications with structured data

Milvus

  • Best for: Large-scale production deployments
  • Pros: High performance, horizontal scaling, multiple language support
  • Cons: Complex deployment and management
  • Use case: High-throughput production systems

2. Managed/Cloud Options

Pinecone

  • Best for: Managed vector search without infrastructure concerns
  • Pros: Fully managed, easy integration, good performance
  • Cons: Vendor lock-in, cost at scale
  • Use case: Startups, rapid prototyping

Qdrant

  • Best for: High-performance applications with filtering requirements
  • Pros: Fast, good filtering capabilities, self-hosted option
  • Cons: Relatively new ecosystem
  • Use case: Performance-critical applications

AWS OpenSearch

  • Best for: Organizations already using AWS
  • Pros: Integrated with AWS ecosystem, managed service
  • Cons: AWS-specific, potentially higher costs
  • Use case: AWS-native applications

Performance Optimization Strategies

1. Index Optimization

flowchart TD
    A["Choose Index Type"] --> B["Flat: < 10K vectors"]
    A --> C["IVF: 10K-1M vectors"]
    A --> D["HNSW: > 1M vectors"]

    B --> E["Perfect Accuracy<br/>Slower Search"]
    C --> F["Good Balance<br/>Requires Training"]
    D --> G["Fast Search<br/>Higher Memory"]

    style B fill:#c8e6c9,stroke:#1B5E20
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e3f2fd,stroke:#1976d2

2. Embedding Optimization

  • Dimension Reduction: Use techniques like PCA to reduce vector dimensions
  • Quantization: Trade accuracy for memory/speed using product quantization
  • Batch Processing: Process multiple queries simultaneously

3. Caching Strategies

  • Query Caching: Cache frequently asked queries
  • Embedding Caching: Cache computed embeddings
  • Result Caching: Cache search results for common patterns

Advanced DPR Techniques

1. Hard Negative Mining

Improve training by finding challenging negative examples:

# Pseudo-code for hard negative mining
def mine_hard_negatives(question, positive_passage, candidate_passages):
    question_embedding = question_encoder(question)

    # Find passages that are similar to question but not the correct answer
    similarities = []
    for passage in candidate_passages:
        if passage != positive_passage:
            passage_embedding = context_encoder(passage)
            sim = dot_product(question_embedding, passage_embedding)
            similarities.append((passage, sim))

    # Select top-k most similar (but incorrect) passages as hard negatives
    hard_negatives = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
    return [passage for passage, _ in hard_negatives]

2. Cross-Encoder Reranking

Use a more sophisticated model to rerank initial DPR results:

flowchart TD
    A["Query"] --> B["DPR Retrieval<br/>(Fast, Approximate)"]
    B --> C["Top-100 Candidates"]
    C --> D["Cross-Encoder Reranking<br/>(Slow, Accurate)"]
    D --> E["Top-10 Final Results"]

    style B fill:#e3f2fd,stroke:#1976d2
    style D fill:#fff3e0,stroke:#f57c00
    style E fill:#c8e6c9,stroke:#1B5E20

3. Multi-Vector Approaches

Use multiple embedding models for different aspects: - ColBERT: Token-level embeddings for fine-grained matching - Multi-aspect embeddings: Separate vectors for different semantic aspects - Ensemble retrieval: Combine multiple retrieval models


Evaluation Metrics

1. Retrieval Accuracy

  • Recall@k: Fraction of relevant documents in top-k results
  • Precision@k: Fraction of retrieved documents that are relevant
  • Mean Reciprocal Rank (MRR): Average of reciprocal ranks of first relevant result

2. Efficiency Metrics

  • Query Latency: Time to process a single query
  • Throughput: Queries processed per second
  • Memory Usage: RAM required for index storage
  • Index Build Time: Time to construct the search index

3. End-to-End RAG Metrics

  • Answer Accuracy: Correctness of final generated answers
  • Faithfulness: How well answers stick to retrieved context
  • Relevance: How well retrieved context addresses the query

Common Challenges & Solutions

Challenge 1: Cold Start Problem

Problem: Poor performance with limited training data Solutions: - Use pre-trained models (e.g., sentence-transformers) - Domain adaptation techniques - Synthetic data generation - Transfer learning from related domains

Challenge 2: Out-of-Domain Queries

Problem: Poor retrieval for queries outside training distribution Solutions: - Diverse training data - Query augmentation techniques - Hybrid search (combine dense + sparse) - Continuous learning from user feedback

Challenge 3: Scalability Issues

Problem: Performance degradation with growing document collections Solutions: - Hierarchical indexing - Distributed search architectures - Approximate search algorithms - Efficient index compression

Challenge 4: Embedding Quality

Problem: Embeddings don't capture domain-specific semantics Solutions: - Domain-specific fine-tuning - Contrastive learning on domain data - Multi-task learning approaches - Regular model evaluation and updates


Integration with RAG Pipeline

Vector stores and DPR fit into the broader [[310-Retrieval-Augmented-Generation-RAG|RAG]] architecture:

flowchart TD
    A["Documents"] --> B["Text Preprocessing"]
    B --> C["DPR Context Encoder"]
    C --> D["Vector Store (FAISS)"]

    E["User Query"] --> F["DPR Question Encoder"]
    F --> G["Query Vector"]
    G --> H["Vector Search"]
    D --> H

    H --> I["Retrieved Passages"]
    I --> J["Generator (LLM)"]
    J --> K["Final Answer"]

    style D fill:#e3f2fd,stroke:#1976d2
    style H fill:#fff3e0,stroke:#f57c00
    style K fill:#c8e6c9,stroke:#1B5E20

Next Steps

Understanding vector stores and DPR provides the foundation for building efficient retrieval systems. The next component to explore is the RAG Generator, which takes the retrieved context and produces coherent answers.

The combination of high-quality embeddings (DPR) and efficient search (FAISS) enables RAG systems to scale to millions of documents while maintaining fast response times and high accuracy.