Papers
arxiv:2602.02185

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Published on Feb 2
ยท Submitted by
Yu Zeng
on Feb 3
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Vision-DeepResearch benchmark addresses limitations in evaluating visual-textual search capabilities of multimodal models by introducing realistic evaluation conditions and improving visual retrieval through multi-round cropped-search workflow.

AI-generated summary

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

Community

Paper author Paper submitter

We introduce the Vision-DeepResearch Benchmark (VDR-Bench) to address two key limitations of existing multimodal deep-research benchmarks: (1) they are not visual-search-centric, allowing many instances to be solved without genuine visual retrieval; and (2) they rely on overly idealized retrieval settings that fail to reflect noisy, real-world search engines. To this end, VDR-Bench comprises 2,000 instances curated with full human involvement and rigorous solvability verification, complementing existing benchmarks by enforcing realistic visual search and evidence-grounded reasoning.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.02185 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.02185 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02185 in a Space README.md to link it from this page.

Collections including this paper 2