RAG (Retrieval-Augmented Generation)

Large language models are constrained by their training data cutoff and context window size. Retrieval-Augmented Generation (RAG) addresses this by adding a retrieval step before generation: a search system pulls relevant documents or chunks from an external knowledge base and injects them into the LLM's context window.

How it works

A typical RAG pipeline has three stages:

Index: Documents are chunked into smaller pieces (paragraphs, pages, or semantic units) and converted into embedding vectors, then stored in a vector database.
Retrieve: The user's query is converted to a vector. The vector database finds the chunks with the highest cosine similarity and returns them.
Generate: The retrieved chunks are prepended to the LLM's context window alongside the query. The model answers using that grounded context.

Why developers use it

RAG addresses three LLM limitations: knowledge cutoffs (answer questions about recent data or proprietary information), hallucination (responses are grounded in actual retrieved passages), and context efficiency (avoids stuffing entire knowledge bases into every prompt).

Common patterns

Naive RAG: Simple top-k retrieval and injection into the prompt. Useful as a baseline.
Hybrid search: Combine dense vector retrieval with sparse keyword-based retrieval (BM25). Improves recall for exact matches like product SKUs or proper nouns.
Reranking: Use a cross-encoder model to rerank the top-k retrieved chunks before passing to the LLM. Reduces irrelevant context.
Self-RAG: The LLM decides when to retrieve and evaluates its own outputs.

Common failure modes

RAG quality depends heavily on chunking strategy and embedding model choice. Poorly chunked documents lose context at boundaries. Mismatched embedding models produce low-quality similarity scores. Retrieval performance is typically the bottleneck, not the LLM generation step. Recent research also shows that retrieval systems can introduce noise that degrades model performance, particularly when retrieved context is only tangentially relevant or contradicts the model's training.

How it works

Why developers use it

Common patterns

Common failure modes

Related terms

Models relevant to RAG (Retrieval-Augmented Generation)