Skip to content

Multimodal Search

Searching across different data types such as text, images, audio, and video within a single shared embedding space, so a query in one modality can retrieve results in another.

Multimodal search is the ability to search across different types of content — text, images, audio, video — within a single system, where a query in one modality can retrieve results in another. You might search a photo library with words, find products by uploading a picture, or locate a video clip by describing it.

This is made possible by multimodal embeddings, which place content of different kinds into one shared vector space. Once an image and a sentence can be represented as comparable vectors, searching across modalities becomes the same nearest-neighbour operation as searching within a single modality. The vector database does not need to know whether a vector came from text or an image; it simply finds the closest matches.

Multimodal search opens up applications that pure text search cannot serve: visual discovery in e-commerce, searching media archives that lack good text labels, content moderation across formats, and richer assistants that can reason over images and audio alongside text. The main requirements are a capable multimodal embedding model and a database able to store and search the resulting vectors, sometimes managing multiple modalities side by side.