A bi-encoder is an embedding architecture that encodes a query and a document independently, producing a separate vector for each, and then measures their similarity with a simple distance calculation. Because the two pieces never interact during encoding, the document vectors can be computed once, stored in a vector database, and reused across every future query.
This independence is what makes bi-encoders fast enough for retrieval at scale. When a query arrives, you only need to embed the query and run an approximate nearest-neighbour search against the pre-computed document vectors — no per-document model inference is required at query time. This is the standard architecture behind dense vector search.
The trade-off is precision. Because the query and document are encoded in isolation, a bi-encoder cannot model fine-grained interactions between specific query and document terms. This is why bi-encoders are often paired with a cross-encoder in a two-stage pipeline: the bi-encoder retrieves a broad candidate set quickly, then the slower but more accurate cross-encoder re-ranks the top candidates.