In a blog post last week, LAION (Large-scale Artificial Intelligence Open Network) trained three large-scale CLIP models—ViT-L/14, ViT-H/14 and ViT-g/14—with OpenCLIP. The creation of this model is believed to have set a new benchmark for driving image classification and generation forward.
CLIP models are typically trained in a self-supervised fashion on numerous (image, text) pairs. The blog says that with LAION, the team produced the ‘LAION-5B dataset’, which is believed to contain 5.8 billion closely related image and text pairs.

CLIP (Contrastive Language – Image Pre-training) is a neural network which learns visual concepts from natural language supervision efficiently. It can be applied to any benchmarks in visual classification by providing the names of the categories to be recognised—similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
The CLIP model ViT B/32 was initially released by OpenAI to filter the dataset out of common crawl. The team believes that the best open source CLIP model out of the LAION-5B dataset completes the open source replication of the CLIP paper, released by OpenAI in 2021.
The new H/14 model aims to achieve top level numbers with a wide application beyond image generation in high-end classification and dataset creation. The H/14 model achieves 78.0% zero shot top-1 accuracy on ImageNet and 73.4% on zero-shot image retrieval at Recall@5 on MS COCO—considered the best open source CLIP model as of September 2022.
The models are expected to be used for many applications such as clip guiding and conditioning, and claim to derive better results on models like stable diffusion. It can be further used for changing the text encoder to work in the multilingual setting or expanding to other modalities, and extracting the knowledge from smaller clips into a bigger one—to help bootstrap the learning processes.
