Skip to content

Jaccard Similarity

A similarity metric measuring the overlap between two sets, defined as the size of their intersection divided by the size of their union.

Jaccard similarity measures how alike two sets are by dividing the size of their intersection by the size of their union. If two sets share all their elements it scores 1; if they share none it scores 0. It captures overlap rather than geometric distance.

This makes it the natural metric for set-based or binary data rather than dense embeddings. It is well suited to comparing things like the set of words in two documents, the tags on two items, or the features two users have in common. Where cosine similarity and Euclidean distance compare positions in continuous space, Jaccard compares membership.

In retrieval systems, Jaccard similarity underpins techniques such as MinHash, a form of locality-sensitive hashing that estimates Jaccard similarity efficiently for tasks like near-duplicate detection across huge collections. While most semantic vector search uses cosine or dot-product metrics, Jaccard remains important wherever data is best represented as sets and overlap is the right notion of similarity.