--- license: mit library_name: transformers pipeline_tag: zero-shot-image-classification base_model: openai/clip-vit-base-patch32 language: - en tags: - clip - vision-language - compositional-reasoning - contrastive-learning - text-encoder - sugarcrepe - whatsup - crepe - valse --- # READ-CLIP (ViT-B/32) **READ-CLIP** is a CLIP model fine-tuned with **READ** (**RE**construction and **A**lignment of text **D**escriptions), a lightweight recipe that strengthens the compositional reasoning of vision–language models. This is the official checkpoint for the NeurIPS 2025 paper *"Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions."* - 📄 **Paper:** [arXiv:2510.16540](https://arxiv.org/abs/2510.16540) (NeurIPS 2025) - 💻 **Code:** [github.com/JiH00nKw0n/READ-CLIP](https://github.com/JiH00nKw0n/READ-CLIP) - 🧩 **Base model:** [`openai/clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) - 🌐 **Project page:** [jih00nkw0n.github.io/READ-CLIP](https://jih00nkw0n.github.io/READ-CLIP/) ![READ-CLIP overview and benchmark comparison](https://raw.githubusercontent.com/JiH00nKw0n/READ-CLIP/master/docs/static/teaser.png) ## Method Contrastively trained CLIP models tend to behave like a bag of words, attending to individual tokens rather than the relationships between them. READ adds two auxiliary objectives on top of the standard contrastive loss during fine-tuning: - **Token-level reconstruction** — a *frozen* T5 decoder (`google/t5-v1_1-large`) reconstructs related captions from the CLIP text embedding, forcing the embedding to retain word-relationship information. - **Sentence-level alignment** — paraphrases of the same caption are pulled together in the embedding space, making representations robust to surface wording. Both objectives are **training-only**. At inference, READ-CLIP is a drop-in `CLIPModel`: no decoder, no extra parameters, and the same compute as the original CLIP. ## Usage The checkpoint loads directly with `transformers` as a standard `CLIPModel`: ```python import torch from transformers import CLIPModel, CLIPProcessor device = "cuda" if torch.cuda.is_available() else "cpu" model = CLIPModel.from_pretrained("Mayfull/READ-CLIP").to(device) processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") inputs = processor( text=["a photo of a cat", "a photo of a dog"], images=image, # a PIL.Image return_tensors="pt", padding=True, ).to(device) with torch.no_grad(): outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=-1) ``` ## Results Compositional reasoning accuracy on five standard benchmarks (ViT-B/32 backbone): | Benchmark | READ-CLIP | NegCLIP | FSC-CLIP | |---------------------|:---------:|:-------:|:--------:| | WhatsUp | **43.9** | 42.4 | 39.8 | | VALSE | **76.2** | 73.7 | 74.4 | | CREPE | 41.5 | 30.5 | **42.5** | | SugarCrepe | **87.0** | 83.6 | 85.2 | | SugarCrepe++ (ITT) | **69.8** | 65.0 | 67.9 | | SugarCrepe++ (TOT) | **66.2** | 62.5 | 64.4 | | **Average** | **64.1** | 59.6 | 62.4 | See the [paper](https://arxiv.org/abs/2510.16540) for the full set of baselines and ablations. ## Training - **Backbone:** `openai/clip-vit-base-patch32` (ViT-B/32) - **Data:** MS-COCO (Karpathy training split, ~113K image–caption pairs) - **Schedule:** 5 epochs, global batch size 256, AdamW, lr 1e-5 (cosine), weight decay 0.1, bf16 - **Hardware:** 1× NVIDIA A100 (~2 GPU-hours), seed 2025 ## Citation ```bibtex @inproceedings{kwon2026enhancing, title={Enhancing Compositional Reasoning in {CLIP} via Reconstruction and Alignment of Text Descriptions}, author={Jihoon Kwon and Kyle Min and Jy-yong Sohn}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2026}, url={https://openreview.net/forum?id=6uKIm4bfEe} } ``` ## License Released under the [MIT License](https://github.com/JiH00nKw0n/READ-CLIP/blob/master/LICENSE).