RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation, or RAG, is the dominant pattern for building language model applications on top of your own data. Instead of relying solely on the knowledge baked into a model during training, RAG retrieves relevant information from an external store at query time and supplies it to the model as context, so the model’s answer is grounded in that material.

The pipeline has three stages. First, indexing: documents are chunked, embedded, and stored in a vector database. Second, retrieval: the user’s query is embedded and the database returns the most semantically similar chunks. Third, generation: those chunks are placed in the model’s prompt alongside the question, and the model produces an answer grounded in the retrieved evidence.

RAG solves several fundamental limitations of language models at once. It keeps answers current, since the vector store can be updated without retraining; it lets the model draw on private or proprietary data it never saw in training; it enables source citation; and it sharply reduces hallucination by giving the model real evidence to work from. This combination is why RAG has become the default architecture for knowledge-intensive AI applications.