Designing a Retrieval Pipeline

A retrieval pipeline is the set of stages that turns a user query into a small, useful set of information for an AI system. A strong pipeline usually combines embedding, indexing, search, metadata filtering, fusion, and reranking, with query rewriting added when the original query is vague, conversational, or split across multiple intents. The main design challenge is balance: broader retrieval improves recall and answer quality, while more stages, larger candidate sets, and heavier rerankers increase latency and cost.

This guide explains how to assemble the core stages of a retrieval pipeline, where query rewriting and fusion fit, and how common design choices affect speed and quality. By the end, you should understand how to think about retrieval as a sequence of decisions rather than a single vector search call, and how to tune each stage for practical AI database applications.

What a Retrieval Pipeline Is Trying to Do

A retrieval pipeline exists because an AI application rarely needs every document in a dataset. It needs the right few pieces of information, in the right order, with enough surrounding context for a model or downstream system to use them safely. In a retrieval-augmented generation system, for example, the pipeline is responsible for finding grounding evidence before the language model produces an answer. In an enterprise search system, it may return ranked documents directly to the user.

The pipeline has two broad jobs. First, it must create a searchable representation of the data. That includes chunking content, generating embeddings, storing metadata, and building indexes that can support fast lookup. Second, it must process incoming queries. That includes understanding the query, searching one or more indexes, applying filters, merging candidate results, reranking them, and returning a final set of passages or records.

The important point is that retrieval is not one decision. A vector database can find semantically similar text, but it does not automatically know which filters matter, which query phrasing best captures user intent, or which candidate result is most useful for the final answer. Those choices belong to the pipeline design.

Once the goal is clear, the next question is how the pieces fit together. A good retrieval pipeline separates offline indexing work from online query-time work so the system can stay both accurate and fast.

The Core Stages of a Retrieval Pipeline

A retrieval pipeline usually has an ingestion side and a query side. The ingestion side prepares content before users search it. The query side decides what to retrieve at runtime. Keeping these responsibilities separate makes the system easier to reason about because some expensive work can happen once during indexing, while the time-sensitive work happens only when a query arrives.

Embedding

Embedding is the process of converting text, images, code, or other content into numeric vectors that represent meaning. In text retrieval, both stored content and user queries are embedded into the same vector space so the system can compare them by similarity. This is what allows a query such as “customer cancellation risk” to find passages about “churn” even when the exact words do not match.

Embedding quality depends on more than the model. It also depends on how content is chunked, whether titles and headings are included, whether tables are represented clearly, and whether the query has enough context. A weak chunking strategy can make a good embedding model look poor because the model is asked to represent fragments that are too short, too broad, or missing important context.

Indexing

Indexing stores the searchable representations of content. A retrieval system may keep a vector index for semantic similarity, an inverted index for keyword search, and structured fields for metadata filters. The index is where document IDs, chunk text, embeddings, timestamps, access permissions, document types, source labels, and other fields become available for search.

Index design affects both relevance and performance. Approximate nearest neighbor indexes such as HNSW are commonly used to make vector search fast at scale, but they introduce tuning decisions around recall, memory, and latency. Keyword indexes are often better at exact terms, identifiers, acronyms, error codes, and numerical language. Metadata fields make it possible to narrow the search space before or after retrieval.

Search

Search is the candidate generation stage. Its job is to retrieve a broad enough set of possible matches that later stages have good material to work with. Search can be vector-only, keyword-only, or hybrid. In many AI database systems, hybrid search is a strong default because it combines semantic matching with lexical precision.

A search stage should usually return more candidates than the final answer needs. If the final system needs five passages, the retrieval stage might fetch 30, 50, or 100 candidates before reranking. The right number depends on latency budget, corpus size, query complexity, and how strong the reranker is. Fetching too few candidates can miss the answer. Fetching too many can slow the system without improving quality.

Filtering

Filtering uses metadata to include or exclude content based on structured conditions. Common filters include user permissions, tenant IDs, document type, language, date range, geography, product area, and source reliability. Filtering is especially important in AI applications because a semantically similar result can still be wrong if it comes from the wrong customer account, time period, jurisdiction, or access boundary.

Filters can be applied before retrieval, during retrieval, or after retrieval. Pre-filtering can improve speed and safety by reducing the search space, but it can also hurt recall if the filter is too strict or if metadata is incomplete. Post-filtering can preserve broader semantic recall, but it may waste work and leave too few results after the filter removes candidates. Many production systems use a mix: mandatory filters such as permissions are applied early, while softer ranking preferences are handled later.

Reranking

Reranking takes an initial candidate set and scores it more carefully. A first-stage retriever may be optimized for speed, while a reranker can spend more computation comparing the query against each candidate in greater detail. Cross-encoder rerankers are a common example: instead of embedding the query and passage separately, they evaluate the query-passage pair together, which often improves precision but adds latency.

Reranking is most useful when the first-stage retriever returns a mixed candidate set that contains good answers alongside partial or loosely related matches. It can also help when results come from multiple retrieval methods, such as vector search and keyword search. The reranker acts as a final relevance judge before the system passes results to an LLM or displays them to a user.

These stages form the basic skeleton, but many real queries need help before they reach the search index. That is where query rewriting, decomposition, and expansion enter the design.

The Core Stages of a Retrieval Pipeline: Embedding, Indexing, Search, Filtering, Reranking. — Five stages turn a user query into ranked, grounded context.

Where Query Rewriting Fits

Query rewriting belongs before retrieval because it changes what the system searches for. The goal is not to invent a new user intent. The goal is to express the same intent in a form that the retrieval system can handle more effectively. This matters most when users ask short, vague, conversational, or follow-up questions that depend on previous context.

For example, in a chat application, a user might ask, “What about pricing?” after several earlier messages about a specific product. The search index cannot reliably interpret that phrase on its own. A query rewriting step might convert it into a standalone query such as “pricing details for the product discussed in the previous turn.” That rewritten query gives both keyword and vector retrieval more useful input.

There are several common rewriting patterns:

Standalone rewriting turns follow-up questions into complete queries that include missing context.
Query expansion adds related terms, synonyms, or domain-specific phrasing to improve recall.
Query decomposition breaks a complex question into smaller subqueries that can be searched separately.
Hypothetical answer generation creates a likely answer-like passage and embeds that passage to find similar content.

Rewriting improves quality when the original query is under-specified or poorly aligned with the indexed content. It can hurt quality when it changes the meaning, adds unsupported assumptions, or expands a precise query into something too broad. For that reason, many systems keep the original query alongside the rewritten query and search with both, rather than replacing the original entirely.

Once multiple query versions or multiple retrieval methods are in play, the system needs a way to combine the results. That is the role of fusion.

Where Fusion Fits

Fusion is the process of merging result lists from multiple searches into one candidate set. It usually happens after candidate retrieval and before reranking. A pipeline might run keyword search and vector search in parallel, retrieve candidates from each, and then use a fusion method to combine them. It might also run several rewritten queries and fuse the results from each query version.

Fusion is useful because different retrieval methods fail in different ways. Vector search can find conceptually related content but may miss exact identifiers, rare terms, or numbers. Keyword search can match exact language but may miss paraphrases. Query expansion can improve recall but may introduce noise. Fusion gives the pipeline a controlled way to benefit from several signals without choosing only one.

One widely used fusion approach is reciprocal rank fusion, often abbreviated as RRF. Instead of trying to compare raw scores from different retrievers, RRF uses the rank position of each result in each list. A document that appears near the top of several result lists receives a stronger combined score than a document that appears low in only one list. This makes fusion practical when vector scores and keyword scores are not directly comparable.

A common query-time order looks like this:

Receive the user query and conversation context.
Rewrite, expand, or decompose the query when needed.
Run vector, keyword, or hybrid searches, often in parallel.
Apply mandatory filters such as permissions and tenant boundaries.
Fuse results from multiple searches or query variants.
Rerank the fused candidate set.
Return the final passages, documents, or records.

Fusion gives the pipeline more candidate diversity, but it also makes tuning more important. The next design question is how each stage affects speed and quality, because every added stage has a cost.

How Design Choices Affect Speed and Quality

Retrieval design is a series of tradeoffs between recall, precision, latency, cost, and operational complexity. Recall means the answer is present somewhere in the retrieved candidate set. Precision means the top results are actually useful. A fast pipeline with poor recall produces confident-looking failures. A slow pipeline with too many heavyweight stages may be accurate but impractical for interactive use.

Chunk Size and Content Representation

Chunk size affects both retrieval quality and downstream usefulness. Small chunks can be precise because they focus on one idea, but they may lack enough context for the model to understand them. Large chunks preserve context but can blur multiple topics together, making embeddings less specific. Many teams tune chunking by evaluating real queries rather than relying on a universal chunk size.

Content representation also matters. Titles, section headings, table summaries, and source metadata often improve retrieval because they give each chunk more context. For structured or tabular content, a plain text dump may not be enough. The pipeline may need table-aware chunking, row-level metadata, or separate retrieval paths for structured fields.

Vector-Only Versus Hybrid Search

Vector-only search is often fast and effective for semantic questions, especially when users do not know the exact wording used in the source content. However, it can struggle with exact identifiers, product codes, quoted phrases, legal citations, dates, and numerical constraints. Keyword search handles those cases better because it rewards exact term overlap.

Hybrid search usually improves quality when the corpus contains a mix of conceptual language and exact terms. The cost is extra tuning. The system must decide how many candidates to retrieve from each method, how to fuse the results, and whether keyword or vector results should dominate in certain query types. A balanced hybrid setup is often a good starting point, then the weights and candidate counts can be adjusted using evaluation data.

Filtering Strategy

Filtering improves quality when it removes content that should never be considered. Permission filters, tenant filters, and source-type filters are examples of constraints that usually belong early in the pipeline. They reduce risk and often improve speed by narrowing the search space.

Other filters require more care. A strict date filter might remove the best background explanation. A document-type filter might exclude a relevant policy because it was mislabeled. If metadata quality is uneven, aggressive pre-filtering can damage recall. In those cases, it may be better to retrieve broadly, then apply filters or boosts during ranking.

Candidate Count

The number of candidates retrieved before reranking has a direct impact on both speed and quality. A larger candidate pool gives the reranker more chances to find the best evidence, but every additional candidate must be scored, stored, or passed through downstream processing. If the reranker is expensive, candidate count becomes one of the most important latency controls in the pipeline.

A practical approach is to start with a moderate candidate set, measure whether the correct answer appears in the candidates, and then tune from there. If recall is low, increase the candidate count or improve query rewriting and hybrid search. If recall is high but final answers are poor, the problem may be reranking, chunk quality, or answer synthesis rather than first-stage retrieval.

Reranker Strength

A stronger reranker can significantly improve the final ordering of results, especially when the first-stage retriever is broad. The tradeoff is latency and cost. A lightweight reranker may be good enough for simple search experiences, while a more powerful cross-encoder may be worthwhile for high-value tasks where relevance matters more than response time.

The best reranking strategy depends on the product experience. A customer support assistant may need answers in a few seconds, so it might rerank only the top 30 to 50 candidates. A research workflow may tolerate more latency if the final evidence quality is higher. The pipeline should match the user’s patience and the cost of being wrong.

These tradeoffs are easier to manage when the pipeline is designed as a measurable system. The goal is not to add every retrieval technique. The goal is to add the techniques that improve real retrieval outcomes for the corpus and users in front of you.

A Practical Pipeline Pattern

A practical retrieval pipeline starts simple and adds sophistication where evaluation shows a gap. The simplest useful version often includes cleaned content, embeddings, a vector index, metadata filters, and a top-k search. That may be enough for a narrow corpus with straightforward queries. As the corpus grows or the queries become more varied, the pipeline usually benefits from keyword search, hybrid fusion, query rewriting, and reranking.

For many AI database applications, a strong production-oriented pattern looks like this:

Ingest and normalize content. Extract text, preserve titles and headings, attach source metadata, and remove duplicated or low-quality content.
Chunk the content thoughtfully. Use chunk sizes and overlap that preserve meaning without mixing too many unrelated topics.
Create embeddings and indexes. Store vector embeddings, keyword-searchable text, and filterable metadata in the AI database or connected retrieval layer.
Rewrite only when useful. Turn follow-up or vague queries into standalone queries, but preserve the original query for retrieval and auditing.
Run hybrid candidate retrieval. Search vector and keyword indexes, with mandatory filters applied early.
Fuse candidate lists. Merge results from multiple query variants or retrieval methods into one candidate pool.
Rerank the candidates. Score the fused candidates against the user’s actual query and keep only the most relevant results.
Return compact context. Send the final passages or records with source metadata so the answer can be grounded and traceable.

This structure keeps the pipeline understandable. Each stage has a clear purpose: embedding represents meaning, indexing makes content searchable, search creates candidates, filters enforce constraints, fusion combines evidence, and reranking improves final precision. When something goes wrong, the team can inspect the stage most likely responsible instead of treating retrieval as a black box.

The pattern also gives teams a sensible way to tune over time. Instead of rebuilding the whole system, they can adjust chunking, candidate counts, filters, fusion rules, or reranking depth based on measured results.

A Practical Pipeline Pattern: 8-step diagram — Ingest and normalize, Chunk thoughtfully, Create embeddings and indexes, Rewrite when useful, Run hybrid retrieval, Fuse candidate lists, Rerank the candidates, Return compact context. — Start simple, then add sophistication where evaluation shows a gap.

How to Evaluate a Retrieval Pipeline

Evaluation is what turns retrieval design from guesswork into engineering. A pipeline may feel good in demos but still fail on edge cases, ambiguous questions, or exact-match queries. The only reliable way to improve it is to test it against representative queries and inspect where the correct evidence appears in the retrieved results.

Useful retrieval metrics include recall at k, mean reciprocal rank, precision at k, and normalized discounted cumulative gain. Recall at k answers a simple question: did the correct evidence appear in the first k results? Mean reciprocal rank rewards systems that place the first correct result higher. Precision at k measures how many of the returned results are useful. Normalized discounted cumulative gain is helpful when results have graded relevance rather than a simple right-or-wrong label.

Metrics should be paired with human review. Human reviewers can identify problems that pure metrics miss, such as stale content, duplicated chunks, misleading context, or results that are technically related but not useful. For RAG systems, teams should also evaluate whether the final generated answer uses the retrieved evidence accurately.

Good evaluation often reveals that different query types need different retrieval behavior. A broad conceptual question may benefit from vector search and expansion. A precise policy question may need keyword matching and strict metadata filters. A multi-part question may need decomposition. The pipeline does not have to treat all queries the same if routing improves results.

After evaluation is in place, optimization becomes more disciplined. The team can make one change at a time, measure its effect, and keep the changes that improve both user experience and system reliability.

Common Mistakes to Avoid

Retrieval pipelines often fail for predictable reasons. The most common mistake is assuming that vector search alone will solve every retrieval problem. Semantic similarity is powerful, but it does not replace exact matching, metadata quality, permissions, or careful ranking. A passage can be semantically close and still be the wrong result for the user’s situation.

Another mistake is adding advanced stages before the basics are working. Query rewriting, fusion, and reranking can improve a strong baseline, but they can also hide underlying problems. If the content is poorly chunked, metadata is unreliable, or the index omits important fields, advanced retrieval stages may only rearrange weak candidates.

Teams also sometimes optimize only for average latency. Average speed matters, but retrieval systems often feel slow because of tail latency: the worst queries trigger too many searches, too many rewritten variants, or too much reranking. A practical pipeline should set limits on candidate counts, rewritten queries, and reranking depth so complex requests do not overwhelm the system.

Finally, many teams skip source inspection. A retrieval pipeline should return enough metadata for developers and users to understand where results came from. Without source IDs, timestamps, document titles, and ranking explanations where possible, it becomes much harder to debug poor answers or build user trust.

Avoiding these mistakes makes the pipeline easier to operate. It also prepares the system for the most important design habit: tuning retrieval based on real use rather than assumptions.

FAQs

1. What is the most important stage in a retrieval pipeline?

There is no single most important stage for every system. In practice, chunking and indexing determine whether good candidates are available, while search, fusion, and reranking determine whether those candidates reach the top. A weak ingestion process cannot be fully fixed at query time, and a weak ranking process can bury good evidence even when it was retrieved.

2. Should every retrieval pipeline use hybrid search?

Not every pipeline needs hybrid search, but many benefit from it. Vector search is strong for semantic similarity, while keyword search is strong for exact terms, names, codes, and numbers. If users ask both conceptual and precise questions, hybrid search is usually worth testing. If the corpus and query patterns are narrow, vector-only or keyword-only retrieval may be enough.

3. Where should metadata filters be applied?

Mandatory filters such as permissions, tenant boundaries, and compliance constraints should usually be applied as early as possible. Softer filters, such as preferred document type or freshness, may work better as ranking signals or post-retrieval adjustments. The right choice depends on metadata quality and whether strict filtering risks removing relevant content.

4. When should query rewriting be used?

Query rewriting is useful when the user query is vague, conversational, missing context, or too complex for a single search. It is especially helpful for follow-up questions in chat and for multi-part questions that need decomposition. It should be used carefully because rewriting can reduce quality if it changes the user’s intent or expands a precise query too broadly.

5. What is the difference between fusion and reranking?

Fusion merges candidate lists from multiple searches or query variants into one list. Reranking then scores the resulting candidates more carefully against the user’s query. Fusion is mainly about combining signals and improving candidate coverage, while reranking is mainly about improving the final order and precision of the results.

6. How many candidates should be retrieved before reranking?

The right number depends on the corpus, latency budget, and reranker cost. Many systems start with a moderate candidate pool and tune from there. If the correct evidence is often missing, the candidate set may be too small or the search strategy may need improvement. If the correct evidence is present but ranked poorly, reranking or fusion may need attention.

Takeaway

A retrieval pipeline is best understood as a sequence of practical choices: how to represent content, how to index it, how to search it, how to filter it, how to combine result lists, and how to rerank the final candidates. This guidance is most useful for teams building AI database applications, RAG systems, enterprise search, support assistants, research tools, or knowledge retrieval workflows where both relevance and latency matter. A well-designed pipeline does not simply retrieve more content; it retrieves the right content quickly enough for the user experience and accurately enough for the application’s risk level.

Watch this video to learn more