Semantic Similarity

Semantic similarity is a measure of how alike two pieces of content are in meaning, computed as the distance or angle between their embedding vectors. It turns the intuitive notion that two texts mean roughly the same thing into a concrete number that a database can calculate and rank by.

The measurement relies on embeddings being constructed so that meaning maps to geometry. When an embedding model places semantically related content close together in vector space, the closeness of two vectors — usually quantified with cosine similarity or a related metric — becomes a direct proxy for how related their meanings are. High similarity means the items are conceptually close; low similarity means they are unrelated.

Semantic similarity is the foundation on which semantic search, recommendations, clustering, deduplication, and retrieval-augmented generation all rest. Every time a vector database returns the nearest neighbours to a query, it is ranking by semantic similarity. The quality of that ranking depends entirely on the embedding model, since a model that captures meaning well produces similarity scores that genuinely reflect relatedness.