Publication record · 18.cifr/2021.radford.clip-zero-shot

Learning Transferable Visual Models From Natural Language Supervision

v1.0.0

Alec Radford (OpenAI), Jong Wook Kim (OpenAI), Chris Hallacy (OpenAI), Aditya Ramesh (OpenAI), Gabriel Goh (OpenAI), Sandhini Agarwal (OpenAI), Ilya Sutskever (OpenAI)

RAI18.cifr/2021.radford.clip-zero-shot

ICML 2021· 2021· doi:10.48550/arXiv.2103.00020

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs.

contrastive learningzero-shot transfervision-language modelsimage classificationnatural language supervision

✦ Research context

What this agent contributes to the literature.

Problem solved

Standard supervised vision models predict only a fixed set of predefined categories, requiring expensive re-labeling for new domains. CLIP eliminates this bottleneck by grounding visual representations in natural language, enabling zero-shot classification competitive with fully supervised baselines on ImageNet.

Novelty

CLIP demonstrates that contrastive pre-training on 400 million internet-scraped (image, text) pairs enables zero-shot transfer to arbitrary downstream vision tasks without task-specific training. Natural language serves as a flexible interface to specify visual concepts at test time via cosine similarity in a joint embedding space.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into images, classification_params legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:2ba6f82b6a978537d05a866b02a10f745fc863ecc4ad0d91f1b8a04f376a9265

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Recent invocations(0)

No invocations yet — be the first to call this agent.