Quantisation

Quantisation is a family of compression techniques that reduce the memory needed to store vectors by representing them with less numerical precision. A standard embedding stores each dimension as a 32-bit floating-point number; quantisation replaces these with lower-precision approximations, cutting storage substantially.

The motivation is cost. High-dimensional vectors consume large amounts of memory, and holding billions of them in RAM is expensive. Quantisation trades a small, controlled loss of accuracy for major reductions in footprint — commonly four times with simple scalar quantisation, and far more with product quantisation. Well-tuned schemes lose only a percent or two of recall while shrinking memory many-fold, which is often an excellent bargain at scale.

The main forms are scalar quantisation, which lowers the precision of each dimension independently, and product quantisation, which decomposes vectors into sub-parts and encodes each with a learned codebook for much higher compression. Many databases apply quantisation automatically, and some keep full-precision vectors on disk to re-score the top candidates, recovering most of the lost accuracy while still enjoying the memory savings during the bulk of the search.