Multimodal Embedding

A multimodal embedding maps different types of data — text, images, audio, video — into a single shared vector space, so that content of different kinds can be compared directly. A picture of a beach and the phrase a sandy shoreline end up as nearby vectors, despite being entirely different media.

This shared representation is what makes cross-modal search possible. Because all modalities live in one space, you can search images with text, find videos similar to an image, or match audio to a description, all using ordinary vector similarity. Models like CLIP pioneered this for images and text, and successors have extended it to more modalities.

The technical achievement is aligning fundamentally different kinds of data so that semantic correspondence becomes geometric proximity. This requires training on paired examples — images with captions, video with transcripts — so the model learns to place related content from different modalities together. Multimodal embeddings underpin a growing class of applications, from visual product search to content moderation, where the boundaries between text, image, and audio search dissolve.