Two-Stage Retrieval

Two-stage retrieval is an architecture that splits the search process into a fast first stage that retrieves a broad set of candidates, followed by a slower, more accurate second stage that re-ranks them. It combines the speed needed to search a large corpus with the precision needed to produce a high-quality final ordering.

The first stage typically uses a bi-encoder and approximate-nearest-neighbour search to quickly pull a few dozen or hundred plausible candidates from millions of vectors, optimising for recall — making sure the genuinely relevant items are somewhere in the set. The second stage applies a more expensive model, usually a cross-encoder, to those candidates, optimising for precision by carefully scoring and reordering them.

This division resolves a fundamental tension: the most accurate relevance models are too slow to run across an entire database, while the fastest search methods are not precise enough on their own. By using each where it fits — fast retrieval for breadth, accurate re-ranking for the top results — two-stage retrieval delivers both scalability and quality, and it is the standard architecture behind high-quality retrieval and RAG systems.