Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation
Abstract
Mind-Brush presents a unified agentic framework for text-to-image generation that dynamically retrieves multimodal evidence and employs reasoning tools to improve understanding of implicit user intentions and complex knowledge reasoning.
While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.
Community
- ๐ง Mind-Brush Framework: A novel agentic paradigm that unifies Intent Analysis, Multi-modal Search, and Knowledge Reasoning into a seamless "Think-Research-Create" workflow for image generation.
- ๐ Mind-Bench: A specialized benchmark designed to evaluate generative models on dynamic external knowledge and complex logical deduction, exposing the reasoning gaps in current SOTA multimodal models.
- ๐ Superior Performance:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs (2026)
- ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning (2025)
- Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts (2026)
- Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models (2026)
- TIR-Flow: Active Video Search and Reasoning with Frozen VLMs (2026)
- SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
