READ-CLIP / README.md
Mayfull's picture
Add overview figure and project page link to card
6e4859e verified
---
license: mit
library_name: transformers
pipeline_tag: zero-shot-image-classification
base_model: openai/clip-vit-base-patch32
language:
- en
tags:
- clip
- vision-language
- compositional-reasoning
- contrastive-learning
- text-encoder
- sugarcrepe
- whatsup
- crepe
- valse
---
# READ-CLIP (ViT-B/32)
**READ-CLIP** is a CLIP model fine-tuned with **READ** (**RE**construction and **A**lignment of text **D**escriptions), a lightweight recipe that strengthens the compositional reasoning of vision–language models. This is the official checkpoint for the NeurIPS 2025 paper *"Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions."*
- 📄 **Paper:** [arXiv:2510.16540](https://arxiv.org/abs/2510.16540) (NeurIPS 2025)
- 💻 **Code:** [github.com/JiH00nKw0n/READ-CLIP](https://github.com/JiH00nKw0n/READ-CLIP)
- 🧩 **Base model:** [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32)
- 🌐 **Project page:** [jih00nkw0n.github.io/READ-CLIP](https://jih00nkw0n.github.io/READ-CLIP/)
![READ-CLIP overview and benchmark comparison](https://raw.githubusercontent.com/JiH00nKw0n/READ-CLIP/master/docs/static/teaser.png)
## Method
Contrastively trained CLIP models tend to behave like a bag of words, attending to individual tokens rather than the relationships between them. READ adds two auxiliary objectives on top of the standard contrastive loss during fine-tuning:
- **Token-level reconstruction** — a *frozen* T5 decoder (`google/t5-v1_1-large`) reconstructs related captions from the CLIP text embedding, forcing the embedding to retain word-relationship information.
- **Sentence-level alignment** — paraphrases of the same caption are pulled together in the embedding space, making representations robust to surface wording.
Both objectives are **training-only**. At inference, READ-CLIP is a drop-in `CLIPModel`: no decoder, no extra parameters, and the same compute as the original CLIP.
## Usage
The checkpoint loads directly with `transformers` as a standard `CLIPModel`:
```python
import torch
from transformers import CLIPModel, CLIPProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("Mayfull/READ-CLIP").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(
text=["a photo of a cat", "a photo of a dog"],
images=image, # a PIL.Image
return_tensors="pt",
padding=True,
).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
```
## Results
Compositional reasoning accuracy on five standard benchmarks (ViT-B/32 backbone):
| Benchmark | READ-CLIP | NegCLIP | FSC-CLIP |
|---------------------|:---------:|:-------:|:--------:|
| WhatsUp | **43.9** | 42.4 | 39.8 |
| VALSE | **76.2** | 73.7 | 74.4 |
| CREPE | 41.5 | 30.5 | **42.5** |
| SugarCrepe | **87.0** | 83.6 | 85.2 |
| SugarCrepe++ (ITT) | **69.8** | 65.0 | 67.9 |
| SugarCrepe++ (TOT) | **66.2** | 62.5 | 64.4 |
| **Average** | **64.1** | 59.6 | 62.4 |
See the [paper](https://arxiv.org/abs/2510.16540) for the full set of baselines and ablations.
## Training
- **Backbone:** `openai/clip-vit-base-patch32` (ViT-B/32)
- **Data:** MS-COCO (Karpathy training split, ~113K image–caption pairs)
- **Schedule:** 5 epochs, global batch size 256, AdamW, lr 1e-5 (cosine), weight decay 0.1, bf16
- **Hardware:** 1× NVIDIA A100 (~2 GPU-hours), seed 2025
## Citation
```bibtex
@inproceedings{kwon2026enhancing,
title={Enhancing Compositional Reasoning in {CLIP} via Reconstruction and Alignment of Text Descriptions},
author={Jihoon Kwon and Kyle Min and Jy-yong Sohn},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2026},
url={https://openreview.net/forum?id=6uKIm4bfEe}
}
```
## License
Released under the [MIT License](https://github.com/JiH00nKw0n/READ-CLIP/blob/master/LICENSE).