CIM-InternVL3-8B

This model is fine-tuned from InternVL3-8B-Instruct using GRPO with the Cross-modal Identity Mapping (CIM) reward, as described in our CVPR 2026 paper.

arXiv GitHub

Overview

CIM is a reinforcement learning framework that improves image captioning by minimizing information loss during modality conversion. It uses two reward signals — Gallery Representation Consistency (GRC) and Query-gallery Image Relevance (QIR) — to encourage LVLMs to generate fine-grained and precise captions without extra annotations.

Citation

@inproceedings{jia2026cross,
    title     = {Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning},
    author    = {Jia, Haonan and Dong, Shichao and Dong, Xin and Sun, Zenghui and Wang, Jin and Lan, Jinsong and Zhu, Xiaoyong and Zheng, Bo and Zhang, Kaifu},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages     = {766--777},
    year      = {2026}
}
Downloads last month
6
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kkk5/CIM-InternVL3-8B

Finetuned
(7)
this model

Collection including kkk5/CIM-InternVL3-8B

Paper for kkk5/CIM-InternVL3-8B