Skip to content

310: Retrieval-Augmented Generation (RAG)

Chapter Overview

Retrieval-Augmented Generation (RAG) is a powerful AI framework that solves the "knowledge problem" for Large Language Models. It enhances a model's capabilities by giving it access to external, private, or real-time information at the moment of inference.

RAG is the go-to technique when a model's failure is due to a lack of information, not a lack of reasoning ability. It is a cornerstone of modern AI Engineering.


The Core Problem: Static Knowledge

A Foundation Model's knowledge is frozen at the time of its training. It knows nothing about events that happened after its training cut-off date, and it has no access to your private company documents.

RAG solves this by creating a system that can retrieve relevant information from an external knowledge base and then augment the model's prompt with that information before generating a response.


The RAG Pipeline at a Glance

A RAG system consists of two main components that work in sequence: a Retriever and a Generator.

graph TD
    A[User Query] --> B[Retriever]
    B --> C[Knowledge Base<br/>Vector Database]
    C --> D[Retrieved Context]
    D --> E[Generator LLM]
    A --> E
    E --> F[Grounded Response]

    subgraph "RAG System Flow"
        direction TB
        B
        E
    end

    style B fill:#e3f2fd,stroke:#1976d2
    style E fill:#e8f5e8,stroke:#388e3c
    style F fill:#c8e6c9,stroke:#1B5E20,stroke-width:2px

Why RAG is Essential for Production AI

RAG addresses several critical limitations of standalone LLMs:

  1. Knowledge Freshness: RAG can access real-time information that wasn't available during model training
  2. Private Data Access: Connect your model to proprietary databases, documents, and internal knowledge
  3. Source Attribution: Provide citations and references for generated responses
  4. Reduced Hallucinations: Ground responses in factual, retrievable information
  5. Domain Expertise: Specialize your AI system for specific industries or use cases

The Two-Phase RAG Architecture

Phase 1: Indexing (Offline)

This is the preparation phase where your knowledge base is processed and made searchable.

flowchart TD
    A[Source Documents<br/>PDFs, HTML, Databases] --> B[Document Loading]
    B --> C[Text Chunking<br/>Split into smaller pieces]
    C --> D[Embedding Generation<br/>Convert chunks to vectors]
    D --> E[Vector Store<br/>Index for fast retrieval]

    style A fill:#fce4ec,stroke:#c2185b
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#e8f5e8,stroke:#388e3c,stroke-width:2px

Phase 2: Querying (Online)

This is the real-time process that happens when a user asks a question.

flowchart TD
    A[User Query] --> B[Query Embedding<br/>Convert question to vector]
    B --> C[Similarity Search<br/>Find relevant chunks]
    C --> D[Context Selection<br/>Top-k most relevant]
    D --> E[Prompt Construction<br/>Query + Context]
    E --> F[LLM Generation<br/>Final answer]

    style A fill:#e8f5e8,stroke:#388e3c
    style C fill:#e3f2fd,stroke:#1976d2
    style F fill:#c8e6c9,stroke:#1B5E20,stroke-width:2px

RAG vs. Fine-Tuning: When to Use Each

Aspect RAG Fine-Tuning
Use Case External knowledge, real-time data Behavior modification, specialized tasks
Implementation Moderate complexity High complexity
Cost Lower ongoing costs Higher training costs
Updates Easy to update knowledge base Requires retraining
Transparency High (can show sources) Low (black box)

Common RAG Implementation Patterns

1. Simple RAG

Basic retrieval → generation pipeline suitable for most use cases.

2. Advanced RAG

Includes query rewriting, multiple retrieval rounds, and answer synthesis.

3. Agentic RAG

Uses AI agents to orchestrate complex retrieval strategies and multi-step reasoning.

4. Multimodal RAG

Extends RAG to handle images, audio, and other non-text data types.


Next Steps: Building Your RAG System

To implement a production-ready RAG system, you'll need to understand:

  1. The Retriever Component - How to build effective search and indexing
  2. The Generator Component - How to craft prompts that use retrieved context effectively
  3. RAG Evaluation and Optimization - How to measure and improve RAG performance

Key RAG Technologies and Tools

Vector Databases

  • Pinecone: Managed vector database service
  • Weaviate: Open-source vector database
  • Chroma: Lightweight vector database for prototyping
  • FAISS: Facebook's similarity search library

Embedding Models

  • OpenAI Embeddings: text-embedding-3-large, text-embedding-3-small
  • Sentence Transformers: Open-source embedding models
  • Cohere Embed: Multilingual embedding service

RAG Frameworks

  • LangChain: Comprehensive framework for LLM applications
  • LlamaIndex: Specialized for RAG and data ingestion
  • Haystack: Production-ready NLP framework

Starting Your RAG Journey

Begin with a simple RAG implementation using a managed service like Pinecone or a local solution like Chroma. Focus on getting the basics right: good chunking strategy, appropriate embedding model, and well-crafted prompts. You can always optimize and add complexity later.