CIM-InternVL2-8B

This model is fine-tuned from InternVL2-8B using GRPO with the Cross-modal Identity Mapping (CIM) reward, as described in our CVPR 2026 paper.

Overview

CIM is a reinforcement learning framework that improves image captioning by minimizing information loss during modality conversion. It uses two reward signals — Gallery Representation Consistency (GRC) and Query-gallery Image Relevance (QIR) — to encourage LVLMs to generate fine-grained and precise captions without extra annotations.

Citation

@inproceedings{jia2026cross,
    title     = {Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning},
    author    = {Jia, Haonan and Dong, Shichao and Dong, Xin and Sun, Zenghui and Wang, Jin and Lan, Jinsong and Zhu, Xiaoyong and Zheng, Bo and Zhang, Kaifu},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {766--777},
    year      = {2026}
}