Publication record · 18.cifr/2021.radford.clip-zero-shot
18.cifr/2021.radford.clip-zero-shotState-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
CLIP underperforms on fine-grained classification (car models, flower species) and abstract reasoning tasks. The authors suggest improved data curation, better prompt engineering, and integration with few-shot learning as natural extensions. Scaling laws between dataset size, model capacity, and zero-shot transfer remain open.