From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models
Abstract
Diagnostic-driven Progressive Evolution enables continuous improvement of large multimodal models through iterative diagnosis and targeted data generation guided by identified weaknesses.
As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.
Community
DPE (Diagnostic-driven Progressive Evolution) is a self-evolving training framework for Large Multimodal Models (LMMs). Inspired by the "diagnose-and-correct" mechanism in educational psychology, DPE moves beyond indiscriminate data expansion. It prioritizes the diagnosis of capability gaps to steer targeted data generation and mixture optimization, effectively breaking the multimodal long-tail bottleneck.
🌟 Key Features
- Adaptive Diagnosis Mechanism: Before each evolution cycle, a diagnostic agent analyzes the model's failure patterns to identify specific weaknesses and capability blind spots. This insight dynamically optimizes the training data mixture.
- Tool-Use Data Evolution: Instead of relying on static datasets or simple text rewriting, DPE employs a multi-agent system equipped with image search and editing tools to source and annotate diverse visual content from external pools.
- High Efficiency: Broad improvements in multimodal reasoning can be achieved with only ~1,000 targeted training examples.
- Enhanced Stability: The closed-loop of diagnosis, generation, and reinforcement significantly improves training stability and mitigates capability regression on long-tail tasks like Mathematics and OCR.
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/from-blind-spots-to-gains-diagnostic-driven-iterative-training-for-large-multimodal-models-8550-392a8e76
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation (2026)
- DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning (2026)
- Dr. Zero: Self-Evolving Search Agents without Training Data (2026)
- MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning (2026)
- Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning (2026)
- Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization (2026)
- iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper