Contrastive Language Image Pretraining (CLIP) Summary

3 min readJun 6, 2023

Both Image — text in a Single Multimodel embedding space.

I Would like to keep my understandings short.

Objective:

Learning Transferable Visual Models from Natural Language Supervision. using text to refer to/point images enabled Zero short Transfer.

A dataset of 400 Million Image-Text pairs is used as part of the pretraining task of predicting which image caption goes with which image.

Learning directly from raw text using Task-agnostic objectives such as

Auto-Regressive(causal LM, Generative task, Only a Decoder model).
Masked Language Modeling has scaled up & improved model capabilities. These text-to-text standard Input-Output interfaces enabled zero-short Transfer to downstream tasks of any kind. GPT-2, and GPT-3 demonstrate it. (Text-to-Text| Bidirectional| Encoder-Decoder)

Approach:

It is much easier to scale text supervision compared to standard crowd-sourced labeling for image classification like classic machine learning.

Learning from the text also has an important advantage over unsupervised, self-supervised approaches as this approach not just learn text/representation but also connects them that enabled zero-shot learning.

Selecting an efficient pretraining method:

Shouldn’t take more computing (Mahajan et al(2018) require 19 GPU years to train their ResNext101–32x-48d to predict only 1000 imagenet classes.).
Shouldn’t have a scalability problem(like Virtex). Both the above approaches try to predict the exact words of the text accompanying the image. this is a difficult task due to the wide variety of descriptions, comments, and related text.
Shouldn’t take extra order of compute(image generative models learn high-quality image representations, but need an extra order of compute).

We started with the BOW encoding baseline and later swapped the predictive objective(loss function) with the contrastive objective.

The CLIP is trained to predict which of the NxN possible(image-text) pairings occurred for a given N(image/text). To do this authors made a CLIP to learn Multi-Modal embedding space by jointly training an Image Encoder & Text Encoder to Maximize the Cosine similarity of Image & Text.

Image:

CLIP is initialized from scratch. We do not use non-linear projection b/n representation and contrastive embedding space. they instead used Linear projection to map from each encoder’s representation to the multi-model embedding space. A random square crop is the only data augmentation used.

A temperature parameter to control the range of logits in the softmax. A global average pooling layer is replaced with an attention pooling mechanism. it is a single “transformer style layer” multi-head attention where Q is conditioned on the Global average pooled representation of the image.

Text:

The text encoder is a Transformer architecture. used 63M parameter 12 layers 512 wide models with 8 attention heads. used lower case BPE(Byte Pair Encodings) representation of text with 49152 vocab size. Max length caped at 76 to reduce computation. [SOS] and [EOS] tokens at the start and end of the text are used. These feature representations are layer normalized & linearly projected into multi-model embedding space.

Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective.