👀 PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

PixelEyes enhances active visual search in MLLMs by delegating fine-grained localization to a specialized perception tool, thereby achieving efficient and accurate multi-turn visual reasoning.

This repository contains the model checkpoint for PixelEyes, presented in the paper PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking.

Citation

If you find this work helpful in your research, please cite our paper:

@misc{gong2026pixeleyesdecouplingperceptionreasoning,
      title={PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking}, 
      author={Dengxian Gong and Yuanzheng Wu and Haobo Yuan and Zhengdong Hu and Tao Zhang and Yikang Zhou and Shihao Chen and Quanzhu Niu and Kai Wang and Jason Li and Haochen Wang and Lu Qi and Shunping Ji and Ming-Hsuan Yang},
      year={2026},
      eprint={2607.00115},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2607.00115}, 
}

Downloads last month: 64

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for godx7/PixelEyes-8B

Quantizations

2 models

Collection including godx7/PixelEyes-8B

PixelEyes

Collection

A Visual Search Agent Guided by a Referring Segmentation Model • 4 items • Updated 3 days ago • 3

Paper for godx7/PixelEyes-8B

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Paper • 2607.00115 • Published 5 days ago • 7