TF-IDF

TF-IDF — Term Frequency–Inverse Document Frequency — is a classic statistical measure of how important a word is to a particular document within a larger collection. It scores a term by combining two factors: how often the term appears in the document, and how rare the term is across all documents.

The intuition is that a word matters to a document if it appears frequently there but seldom elsewhere. Common words like the appear everywhere and carry little distinguishing information, so their inverse-document-frequency weight is low; a specialised term that shows up often in one document but rarely in the corpus gets a high weight, marking it as characteristic of that document.

TF-IDF underlies traditional keyword search and produces sparse vector representations of documents. While the more refined BM25 has largely superseded it for ranking, TF-IDF remains foundational to understanding sparse retrieval and still appears in many systems. In the context of vector databases, it represents the lexical, keyword-matching tradition that hybrid search combines with dense semantic embeddings to get the best of both.