Skip to content

CLIP

A multimodal model by OpenAI that maps images and text into a shared vector space, enabling cross-modal similarity search.

CLIP — Contrastive Language-Image Pre-training — is a model released by OpenAI that learns to place images and their text descriptions into the same vector space. After training on hundreds of millions of image-caption pairs, a photo of a dog and the phrase a photograph of a dog end up as nearby vectors, while unrelated pairs land far apart.

This shared embedding space is what makes cross-modal search possible. With CLIP, you can search an image collection using a text query, find images similar to another image, or match captions to pictures — all using ordinary vector similarity. It is the standard embedding backbone for multimodal vector database applications such as visual product search and content moderation.

CLIP’s contrastive training approach — pulling matching image-text pairs together and pushing mismatched ones apart — proved so effective that it spawned a whole family of successors. Together they established multimodal embedding as a practical, production-ready capability rather than a research curiosity.