Attention mechanisms, ViT, and multimodal vision-language models
Transformers, originally developed for NLP, have reshaped computer vision. The Vision Transformer (ViT) treats an image as a sequence of patches and processes them with standard self-attention, rivaling and often exceeding CNN performance on large-scale benchmarks.
The core of the transformer is the scaled dot-product attention operation:Attention(Q,K,V)=softmax(dkQKT)Vwhere Q (queries), K (keys), and V (values) are linear projections of the input sequence. The dk scaling prevents dot products from growing large in magnitude and saturating the softmax.Multi-head attention runs h independent attention functions in parallel and concatenates the outputs:MultiHead(Q,K,V)=Concat(head1,…,headh)WOThis lets the model jointly attend to information from different representation subspaces.
ViT adapts the transformer encoder to images through three steps:
1
Patchify
Divide the H×W image into non-overlapping patches of size P×P. This produces N=HW/P2 patches. Standard ViT-B/16 uses P=16 on 224×224 images, giving N=196 patches.
2
Embed
Flatten each patch to a vector and project it linearly to dimension D. Prepend a learnable [CLS] token whose final representation is used for classification. Add positional embeddings (learned 1D or 2D) to preserve spatial information.z0=[xcls;xp1E;xp2E;…;xpNE]+Epos
3
Encode
Pass the token sequence through L transformer encoder layers. Classify using the [CLS] token output through an MLP head:y=MLP(zL0)
ViT requires large training datasets (JFT-300M, ImageNet-21k) to outperform CNNs. On smaller datasets, convolutional inductive biases (translation equivariance, locality) give CNNs an advantage. Hybrid models combine CNN feature extractors with transformer encoders.
CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) trains an image encoder and a text encoder jointly on 400 million (image, text) pairs from the internet. The objective aligns matching pairs close together and pushes non-matching pairs apart in a shared embedding space.
For a batch of N (image, text) pairs, CLIP maximizes the cosine similarity of the N correct pairs while minimizing similarity for the N2−N incorrect pairs:LCLIP=−2N1∑i=1N[log∑jefi⋅gj/τefi⋅gi/τ+log∑jefj⋅gi/τefi⋅gi/τ]where fi and gi are the ℓ2-normalized image and text embeddings, and τ is a learned temperature.