313: Vector Stores & Dense Passage Retrieval (DPR)¶
Chapter Overview
The Retriever's ability to find relevant information quickly and accurately depends on two key technologies: 1. A high-quality embedding model to convert text into meaningful numerical vectors. 2. A high-performance vector store to store and search through those vectors.
Dense Passage Retrieval (DPR) and FAISS are cornerstone examples of these technologies.
Dense Passage Retrieval (DPR): The Embedding Model¶
DPR is a specialized model architecture, usually based on BERT, designed specifically for the task of retrieval. It's not a single model, but a pair of models trained together:
- Context Encoder: This model is an expert at creating vector representations (embeddings) for passages of text (the potential answers).
- Question Encoder: This model is an expert at creating embeddings for questions.
How DPR is Trained¶
The two encoders are trained using a technique called contrastive learning. The goal is to train the models so that the vector for a question is mathematically very close (high dot-product similarity) to the vector of its correct answer passage, while simultaneously being very far from the vectors of all other "negative" (incorrect) passages.
This specialized training makes DPR highly effective for building the semantic search capability at the heart of RAG.
FAISS: The Vector Search Library¶
A Vector Store (or Vector Database) is a system designed for the efficient storage and searching of high-dimensional vectors. While a simple retriever could just calculate the distance between a query vector and every document vector one-by-one, this is impossibly slow for millions of documents.
FAISS (Facebook AI Similarity Search) is a highly optimized open-source library that provides the algorithms for this efficient search. It is the engine inside many popular vector stores.
How FAISS Works (High-Level)¶
Instead of a brute-force search, FAISS uses indexing techniques to speed up the process dramatically. A common approach is Approximate Nearest Neighbor (ANN) search.
flowchart TD
subgraph BruteForce ["Brute-Force Search (Slow)"]
A["Query Vector"] --> B["Compare with<br/>Doc Vector 1"]
A --> C["Compare with<br/>Doc Vector 2"]
A --> D["Compare with<br/>Doc Vector ...n"]
end
subgraph FAISS ["FAISS ANN Search (Fast)"]
E["Query Vector"] --> F["Find a few<br/>'candidate' clusters"]
F --> G["Search only within<br/>those clusters"]
end
G --> H["⚡️ Fast, Approximate Results"]
style A fill:#e3f2fd,stroke:#1976d2
style E fill:#e3f2fd,stroke:#1976d2
style D fill:#ffcdd2,stroke:#B71C1C
style H fill:#c8e6c9,stroke:#1B5E20
FAISS Index Types¶
FAISS offers different index types optimized for different use cases:
1. Flat Index (Exact Search)¶
- Use case: Small datasets (< 10K vectors)
- Pros: Perfect accuracy, simple
- Cons: Slow for large datasets
- Code:
IndexFlatIP
(Inner Product),IndexFlatL2
(L2 distance)
2. IVF (Inverted File)¶
- Use case: Medium datasets (10K - 1M vectors)
- Pros: Good balance of speed and accuracy
- Cons: Requires training step
- Code:
IndexIVFFlat
,IndexIVFPQ
3. HNSW (Hierarchical Navigable Small World)¶
- Use case: Large datasets (1M+ vectors)
- Pros: Very fast, good accuracy
- Cons: Higher memory usage
- Code:
IndexHNSWFlat
4. PQ (Product Quantization)¶
- Use case: Memory-constrained environments
- Pros: Very low memory usage
- Cons: Reduced accuracy
- Code:
IndexPQ
,IndexIVFPQ
DPR Training Process¶
Step 1: Data Preparation¶
DPR requires training triplets: - Question: User query - Positive Passage: Correct answer passage - Negative Passages: Incorrect but plausible passages
flowchart TD
A["Training Data"] --> B["Question:<br/>'What is the capital of France?'"]
A --> C["Positive Passage:<br/>'Paris is the capital and largest city of France...'"]
A --> D["Negative Passages:<br/>'London is the capital of England...'<br/>'Madrid is the capital of Spain...'"]
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#c8e6c9,stroke:#1B5E20
style D fill:#ffcdd2,stroke:#B71C1C
Step 2: Dual Encoder Architecture¶
flowchart TD
subgraph QuestionEncoder ["Question Encoder (BERT)"]
A["Question Text"] --> B["BERT Layers"] --> C["Question Vector"]
end
subgraph ContextEncoder ["Context Encoder (BERT)"]
D["Passage Text"] --> E["BERT Layers"] --> F["Passage Vector"]
end
subgraph SimilarityComputation ["Similarity Computation"]
C --> G["Dot Product<br/>Similarity Score"]
F --> G
end
G --> H["Loss Function<br/>(Contrastive Learning)"]
style C fill:#e3f2fd,stroke:#1976d2
style F fill:#fff3e0,stroke:#f57c00
style G fill:#f3e5f5,stroke:#7b1fa2
Step 3: Contrastive Learning Loss¶
The loss function ensures: - High similarity between question and positive passage - Low similarity between question and negative passages
Where: - \(q\) = question vector - \(p^+\) = positive passage vector - \(p_i\) = all passage vectors (positive + negatives)
Vector Store Implementation Patterns¶
1. Embedding-First Approach¶
# Pseudo-code for embedding-first pattern
def build_index(documents):
embeddings = []
for doc in documents:
embedding = embedding_model.encode(doc)
embeddings.append(embedding)
# Build FAISS index
index = faiss.IndexFlatIP(embedding_dimension)
index.add(embeddings)
return index
def search(query, index, k=5):
query_embedding = embedding_model.encode(query)
scores, indices = index.search(query_embedding, k)
return indices, scores
2. Streaming Approach¶
# Pseudo-code for streaming pattern
def build_streaming_index(document_stream):
index = faiss.IndexFlatIP(embedding_dimension)
for batch in document_stream:
embeddings = embedding_model.encode(batch)
index.add(embeddings)
return index
3. Hybrid Search Pattern¶
Combines dense (vector) and sparse (keyword) search:
flowchart TD
A["User Query"] --> B["Dense Search<br/>(Vector Similarity)"]
A --> C["Sparse Search<br/>(BM25/TF-IDF)"]
B --> D["Dense Results<br/>(Semantic matches)"]
C --> E["Sparse Results<br/>(Keyword matches)"]
D --> F["Result Fusion<br/>(Weighted combination)"]
E --> F
F --> G["Final Ranked Results"]
style D fill:#e3f2fd,stroke:#1976d2
style E fill:#fff3e0,stroke:#f57c00
style G fill:#c8e6c9,stroke:#1B5E20
Popular Vector Store Solutions¶
1. Open Source Options¶
Chroma¶
- Best for: Prototyping, small-scale applications
- Pros: Simple API, lightweight, good for development
- Cons: Limited scalability
- Use case: Local development, proof of concepts
Weaviate¶
- Best for: Production applications with complex filtering
- Pros: GraphQL API, built-in ML models, schema support
- Cons: More complex setup
- Use case: Enterprise applications with structured data
Milvus¶
- Best for: Large-scale production deployments
- Pros: High performance, horizontal scaling, multiple language support
- Cons: Complex deployment and management
- Use case: High-throughput production systems
2. Managed/Cloud Options¶
Pinecone¶
- Best for: Managed vector search without infrastructure concerns
- Pros: Fully managed, easy integration, good performance
- Cons: Vendor lock-in, cost at scale
- Use case: Startups, rapid prototyping
Qdrant¶
- Best for: High-performance applications with filtering requirements
- Pros: Fast, good filtering capabilities, self-hosted option
- Cons: Relatively new ecosystem
- Use case: Performance-critical applications
AWS OpenSearch¶
- Best for: Organizations already using AWS
- Pros: Integrated with AWS ecosystem, managed service
- Cons: AWS-specific, potentially higher costs
- Use case: AWS-native applications
Performance Optimization Strategies¶
1. Index Optimization¶
flowchart TD
A["Choose Index Type"] --> B["Flat: < 10K vectors"]
A --> C["IVF: 10K-1M vectors"]
A --> D["HNSW: > 1M vectors"]
B --> E["Perfect Accuracy<br/>Slower Search"]
C --> F["Good Balance<br/>Requires Training"]
D --> G["Fast Search<br/>Higher Memory"]
style B fill:#c8e6c9,stroke:#1B5E20
style C fill:#fff3e0,stroke:#f57c00
style D fill:#e3f2fd,stroke:#1976d2
2. Embedding Optimization¶
- Dimension Reduction: Use techniques like PCA to reduce vector dimensions
- Quantization: Trade accuracy for memory/speed using product quantization
- Batch Processing: Process multiple queries simultaneously
3. Caching Strategies¶
- Query Caching: Cache frequently asked queries
- Embedding Caching: Cache computed embeddings
- Result Caching: Cache search results for common patterns
Advanced DPR Techniques¶
1. Hard Negative Mining¶
Improve training by finding challenging negative examples:
# Pseudo-code for hard negative mining
def mine_hard_negatives(question, positive_passage, candidate_passages):
question_embedding = question_encoder(question)
# Find passages that are similar to question but not the correct answer
similarities = []
for passage in candidate_passages:
if passage != positive_passage:
passage_embedding = context_encoder(passage)
sim = dot_product(question_embedding, passage_embedding)
similarities.append((passage, sim))
# Select top-k most similar (but incorrect) passages as hard negatives
hard_negatives = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
return [passage for passage, _ in hard_negatives]
2. Cross-Encoder Reranking¶
Use a more sophisticated model to rerank initial DPR results:
flowchart TD
A["Query"] --> B["DPR Retrieval<br/>(Fast, Approximate)"]
B --> C["Top-100 Candidates"]
C --> D["Cross-Encoder Reranking<br/>(Slow, Accurate)"]
D --> E["Top-10 Final Results"]
style B fill:#e3f2fd,stroke:#1976d2
style D fill:#fff3e0,stroke:#f57c00
style E fill:#c8e6c9,stroke:#1B5E20
3. Multi-Vector Approaches¶
Use multiple embedding models for different aspects: - ColBERT: Token-level embeddings for fine-grained matching - Multi-aspect embeddings: Separate vectors for different semantic aspects - Ensemble retrieval: Combine multiple retrieval models
Evaluation Metrics¶
1. Retrieval Accuracy¶
- Recall@k: Fraction of relevant documents in top-k results
- Precision@k: Fraction of retrieved documents that are relevant
- Mean Reciprocal Rank (MRR): Average of reciprocal ranks of first relevant result
2. Efficiency Metrics¶
- Query Latency: Time to process a single query
- Throughput: Queries processed per second
- Memory Usage: RAM required for index storage
- Index Build Time: Time to construct the search index
3. End-to-End RAG Metrics¶
- Answer Accuracy: Correctness of final generated answers
- Faithfulness: How well answers stick to retrieved context
- Relevance: How well retrieved context addresses the query
Common Challenges & Solutions¶
Challenge 1: Cold Start Problem¶
Problem: Poor performance with limited training data Solutions: - Use pre-trained models (e.g., sentence-transformers) - Domain adaptation techniques - Synthetic data generation - Transfer learning from related domains
Challenge 2: Out-of-Domain Queries¶
Problem: Poor retrieval for queries outside training distribution Solutions: - Diverse training data - Query augmentation techniques - Hybrid search (combine dense + sparse) - Continuous learning from user feedback
Challenge 3: Scalability Issues¶
Problem: Performance degradation with growing document collections Solutions: - Hierarchical indexing - Distributed search architectures - Approximate search algorithms - Efficient index compression
Challenge 4: Embedding Quality¶
Problem: Embeddings don't capture domain-specific semantics Solutions: - Domain-specific fine-tuning - Contrastive learning on domain data - Multi-task learning approaches - Regular model evaluation and updates
Integration with RAG Pipeline¶
Vector stores and DPR fit into the broader [[310-Retrieval-Augmented-Generation-RAG|RAG]] architecture:
flowchart TD
A["Documents"] --> B["Text Preprocessing"]
B --> C["DPR Context Encoder"]
C --> D["Vector Store (FAISS)"]
E["User Query"] --> F["DPR Question Encoder"]
F --> G["Query Vector"]
G --> H["Vector Search"]
D --> H
H --> I["Retrieved Passages"]
I --> J["Generator (LLM)"]
J --> K["Final Answer"]
style D fill:#e3f2fd,stroke:#1976d2
style H fill:#fff3e0,stroke:#f57c00
style K fill:#c8e6c9,stroke:#1B5E20
Next Steps¶
Understanding vector stores and DPR provides the foundation for building efficient retrieval systems. The next component to explore is the RAG Generator, which takes the retrieved context and produces coherent answers.
The combination of high-quality embeddings (DPR) and efficient search (FAISS) enables RAG systems to scale to millions of documents while maintaining fast response times and high accuracy.