RAG (Retrieval-Augmented Generation)
A technique that improves LLM responses by retrieving relevant documents from an external knowledge base before generation, reducing hallucinations and enabling access to information beyond training data.
Large language models are constrained by their training data cutoff and context window size. Retrieval-Augmented Generation (RAG) addresses this by adding a retrieval step before generation: a search system pulls relevant documents or chunks from an external knowledge base and injects them into the LLM's context window.
How it works
A typical RAG pipeline has three stages:
- Index: Documents are chunked into smaller pieces (paragraphs, pages, or semantic units) and converted into embedding vectors, then stored in a vector database.
- Retrieve: The user's query is converted to a vector. The vector database finds the chunks with the highest cosine similarity and returns them.
- Generate: The retrieved chunks are prepended to the LLM's context window alongside the query. The model answers using that grounded context.
Why developers use it
RAG addresses three LLM limitations: knowledge cutoffs (answer questions about recent data or proprietary information), hallucination (responses are grounded in actual retrieved passages), and context efficiency (avoids stuffing entire knowledge bases into every prompt).
Common patterns
- Naive RAG: Simple top-k retrieval and injection into the prompt. Useful as a baseline.
- Hybrid search: Combine dense vector retrieval with sparse keyword-based retrieval (BM25). Improves recall for exact matches like product SKUs or proper nouns.
- Reranking: Use a cross-encoder model to rerank the top-k retrieved chunks before passing to the LLM. Reduces irrelevant context.
- Self-RAG: The LLM decides when to retrieve and evaluates its own outputs.
Common failure modes
RAG quality depends heavily on two factors: chunking strategy and embedding model choice. Poorly chunked documents lose context at boundaries. Mismatched embedding models produce low-quality similarity scores. In practice, retrieval performance is almost always the bottleneck, not the LLM generation step.
Related terms
Models relevant to RAG (Retrieval-Augmented Generation)
Command R+
Cohere's enterprise model purpose-built for RAG and production tool-calling pipelines
View model →Amazon Nova Pro
AWS's multimodal frontier model - natively integrated with Bedrock and the AWS ecosystem
View model →Gemini 2.5 Flash
Google's fastest and cheapest model with a 1M context - hard to beat on price/performance
View model →