Sharding

Sharding is the practice of splitting a vector index across multiple machines, with each machine — each shard — holding a portion of the total vectors. Queries are sent to all shards in parallel, and their partial results are merged and re-ranked before being returned, allowing the system to scale beyond the capacity of any single machine.

Sharding becomes necessary when a dataset is too large to fit on one node, or when query volume exceeds what one node can serve. By distributing the data, sharding scales horizontally: adding more shards increases both how many vectors the system can hold and how many queries it can handle in parallel, since the work of each query is spread across machines.

A key design decision is how vectors are assigned to shards. Random assignment spreads load evenly but scatters related vectors; partitioning by an attribute like tenant or category can improve query locality but risks unbalanced shards. Sharding is frequently combined with replication — each shard is replicated for resilience and read scaling — and together they form the backbone of large-scale, distributed vector search. Managed services often handle sharding transparently.