Zero-Shot Image Classification
Transformers
Safetensors
English
clip
vision-language
compositional-reasoning
contrastive-learning
text-encoder
sugarcrepe
whatsup
crepe
valse
Instructions to use Mayfull/READ-CLIP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Mayfull/READ-CLIP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="Mayfull/READ-CLIP") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("Mayfull/READ-CLIP") model = AutoModelForZeroShotImageClassification.from_pretrained("Mayfull/READ-CLIP") - Notebooks
- Google Colab
- Kaggle
File size: 4,133 Bytes
353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 974f270 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 6e4859e 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 353cfa2 3b26125 974f270 3b26125 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | ---
license: mit
library_name: transformers
pipeline_tag: zero-shot-image-classification
base_model: openai/clip-vit-base-patch32
language:
- en
tags:
- clip
- vision-language
- compositional-reasoning
- contrastive-learning
- text-encoder
- sugarcrepe
- whatsup
- crepe
- valse
---
# READ-CLIP (ViT-B/32)
**READ-CLIP** is a CLIP model fine-tuned with **READ** (**RE**construction and **A**lignment of text **D**escriptions), a lightweight recipe that strengthens the compositional reasoning of vision–language models. This is the official checkpoint for the NeurIPS 2025 paper *"Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions."*
- 📄 **Paper:** [arXiv:2510.16540](https://arxiv.org/abs/2510.16540) (NeurIPS 2025)
- 💻 **Code:** [github.com/JiH00nKw0n/READ-CLIP](https://github.com/JiH00nKw0n/READ-CLIP)
- 🧩 **Base model:** [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32)
- 🌐 **Project page:** [jih00nkw0n.github.io/READ-CLIP](https://jih00nkw0n.github.io/READ-CLIP/)

## Method
Contrastively trained CLIP models tend to behave like a bag of words, attending to individual tokens rather than the relationships between them. READ adds two auxiliary objectives on top of the standard contrastive loss during fine-tuning:
- **Token-level reconstruction** — a *frozen* T5 decoder (`google/t5-v1_1-large`) reconstructs related captions from the CLIP text embedding, forcing the embedding to retain word-relationship information.
- **Sentence-level alignment** — paraphrases of the same caption are pulled together in the embedding space, making representations robust to surface wording.
Both objectives are **training-only**. At inference, READ-CLIP is a drop-in `CLIPModel`: no decoder, no extra parameters, and the same compute as the original CLIP.
## Usage
The checkpoint loads directly with `transformers` as a standard `CLIPModel`:
```python
import torch
from transformers import CLIPModel, CLIPProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("Mayfull/READ-CLIP").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(
text=["a photo of a cat", "a photo of a dog"],
images=image, # a PIL.Image
return_tensors="pt",
padding=True,
).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
```
## Results
Compositional reasoning accuracy on five standard benchmarks (ViT-B/32 backbone):
| Benchmark | READ-CLIP | NegCLIP | FSC-CLIP |
|---------------------|:---------:|:-------:|:--------:|
| WhatsUp | **43.9** | 42.4 | 39.8 |
| VALSE | **76.2** | 73.7 | 74.4 |
| CREPE | 41.5 | 30.5 | **42.5** |
| SugarCrepe | **87.0** | 83.6 | 85.2 |
| SugarCrepe++ (ITT) | **69.8** | 65.0 | 67.9 |
| SugarCrepe++ (TOT) | **66.2** | 62.5 | 64.4 |
| **Average** | **64.1** | 59.6 | 62.4 |
See the [paper](https://arxiv.org/abs/2510.16540) for the full set of baselines and ablations.
## Training
- **Backbone:** `openai/clip-vit-base-patch32` (ViT-B/32)
- **Data:** MS-COCO (Karpathy training split, ~113K image–caption pairs)
- **Schedule:** 5 epochs, global batch size 256, AdamW, lr 1e-5 (cosine), weight decay 0.1, bf16
- **Hardware:** 1× NVIDIA A100 (~2 GPU-hours), seed 2025
## Citation
```bibtex
@inproceedings{kwon2026enhancing,
title={Enhancing Compositional Reasoning in {CLIP} via Reconstruction and Alignment of Text Descriptions},
author={Jihoon Kwon and Kyle Min and Jy-yong Sohn},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2026},
url={https://openreview.net/forum?id=6uKIm4bfEe}
}
```
## License
Released under the [MIT License](https://github.com/JiH00nKw0n/READ-CLIP/blob/master/LICENSE).
|