A retrieval pipeline is the end-to-end sequence of steps that takes a query and produces a set of relevant results, typically comprising embedding the query, searching the vector index, applying filters, and optionally re-ranking. It is the runtime path a request travels through to get its answer, and its design governs both the speed and the quality of retrieval.
A typical pipeline begins by transforming the query — sometimes rewriting or expanding it — and embedding it into a vector. That vector is used for approximate-nearest-neighbour search, often combined with keyword search in a hybrid setup and constrained by metadata filters. The resulting candidates may then pass through a re-ranking stage that reorders them with a more precise model before the top results are returned to the application or fed to a language model.
Each stage offers levers for tuning the trade-off between speed, cost, and relevance. A simple pipeline may be just embed-and-search, while a sophisticated one layers in query rewriting, hybrid retrieval, filtering, and re-ranking. Designing the pipeline well — choosing which stages to include and how to configure them — is central to building a retrieval system that is both fast and accurate.