CIM
Collection
Model weights for "Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning" (CVPR 2026) • 9 items • Updated • 1
This model is fine-tuned from InternVL2-8B using GRPO with the Cross-modal Identity Mapping (CIM) reward, as described in our CVPR 2026 paper.
CIM is a reinforcement learning framework that improves image captioning by minimizing information loss during modality conversion. It uses two reward signals — Gallery Representation Consistency (GRC) and Query-gallery Image Relevance (QIR) — to encourage LVLMs to generate fine-grained and precise captions without extra annotations.
@inproceedings{jia2026cross,
title = {Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning},
author = {Jia, Haonan and Dong, Shichao and Dong, Xin and Sun, Zenghui and Wang, Jin and Lan, Jinsong and Zhu, Xiaoyong and Zheng, Bo and Zhang, Kaifu},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages = {766--777},
year = {2026}
}
Base model
OpenGVLab/InternVL2-8B