Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Abstract
ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.
On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.
Community
๐ ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD
MLLMs can reason impressively โ but do they really look before they reason? ๐
Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.
ViGOS fixes this with a simple but powerful idea: see first, reason second. โจ
The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts โ and all teachers are removed at inference time.
๐ Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings โ helping models trust the image when priors are wrong. ๐ฅ
One-line pitch:
๐ง โก๏ธ๐๏ธ ViGOS teaches MLLMs to ground visual evidence before reasoning โ reducing shortcuts without sacrificing strong answer guidance.
๐ Links
- Project Page: https://oedosoldier.github.io/ViGOS/
- Paper: https://arxiv.org/abs/2606.19120
- Code: https://github.com/OedoSoldier/ViGOS
- ViGOS-3B: https://huggingface.co/OedoSoldier/ViGOS-3B
- ViGOS-7B: https://huggingface.co/OedoSoldier/ViGOS-7B
Get this paper in your agent:
hf papers read 2606.19120 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
OedoSoldier/ViGOS-7B
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper