LSH (Locality Sensitive Hashing)

LSH — Locality Sensitive Hashing — is a family of techniques for approximate similarity search that hash vectors so that similar ones are likely to land in the same bucket, while dissimilar ones land in different buckets. Unlike ordinary hashing, which scatters inputs to avoid collisions, LSH is designed to make similar items collide on purpose.

This property turns search into a lookup. To find vectors similar to a query, you hash the query and examine only the items sharing its bucket, rather than comparing against the whole dataset. This reduces search from a linear scan to something sub-linear, with the probability of finding true neighbours tunable by using multiple hash functions and tables.

Different LSH schemes target different metrics — random projections for cosine similarity, MinHash for Jaccard similarity, and others for Euclidean distance. Compared with graph methods like HNSW, LSH generally offers lower recall for a given query budget, but it has advantages in certain settings: it handles streaming data well, supports probabilistic guarantees, and requires no heavy build phase, making it useful for specific large-scale or theoretical applications.