OpenAI CLIP: The Model That Learnt Zero-Shot Image Recognition Using Text

By Alex Xu · ByteByteGo Newsletter ·Dec 29, 2025 · 10 min read

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

ImageNet 13 min read
The article extensively discusses ImageNet as the traditional approach CLIP moved away from, mentioning its 25,000 workers labeling 14 million images and specific accuracy metrics (76% dropping to 37% on sketches). Understanding ImageNet's history, creation, and impact on computer vision provides crucial context for appreciating CLIP's innovation.
Backpropagation 13 min read
The article describes backpropagation as 'the fundamental learning mechanism in neural networks' and explains how it enables CLIP's training. Most readers understand neural networks at a high level but don't know the mathematical details of how weights actually update during training.
Embedding 10 min read
Embeddings are central to CLIP's architecture - the article explains how both image and text encoders produce vectors in the same dimensional space for comparison. Understanding vector embeddings, their mathematical properties, and why they enable semantic similarity comparisons is foundational to grasping CLIP's innovation.

If Your API Isn’t Fresh, Your Agents Aren’t Either. (Sponsored)

In the agentic era, outdated retrieval breaks workflows. This API Benchmark Report from You.com shows how each major search API performs to reveal which can best answer real-world, time-sensitive queries.

What’s inside:

Head-to-head benchmarks comparing You.com, Google SerpAPI, Exa, and Tavily across accuracy, latency, and cost
Critical performance data to identify which APIs best handle time-sensitive queries
A data-driven analysis of the Latency vs. Accuracy trade-off to help you select the best retrieval layer for enterprise agents

Curious who performed best?

Disclaimer: The details in this post have been derived from the details shared online by the OpenAI Engineering Team. All credit for the technical details goes to the OpenAI Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Imagine teaching a computer to recognize objects not by showing it millions of labeled photos, but by letting it browse the internet and learn from how people naturally describe images. That’s exactly what OpenAI’s CLIP does, and it represents a fundamental shift in how we teach machines to understand visual content.

CLIP (Contrastive Language-Image Pre-training) is a neural network that connects vision and language. Released in January 2021, it can classify images into any categories you want without being specifically trained for that task. Just tell it what you’re looking for in plain English, and it can recognize it. This “zero-shot” capability makes CLIP different from almost every computer vision system that came before it.

In this article, we will look at how CLIP works and the problems it tries to solve.

The Problem CLIP Solves

Traditional computer vision followed a rigid formula. If you want a model to distinguish cats from dogs, you need thousands of labeled photos. For different car models, you need another expensive dataset. For reference, ImageNet, one of the most famous image datasets, required over 25,000 workers to label 14 million images.

This approach created three major problems:

First, datasets were expensive and time-consuming to build.
Second, models became narrow specialists. An ImageNet model could recognize 1,000 categories, but adapting it to new tasks required collecting more data and retraining.
Third,

...

Read full article on ByteByteGo Newsletter →

This excerpt is provided for preview purposes. Full article content is available on the original publication.