Papers
arxiv:2606.19120

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Published on Jun 17
ยท Submitted by
OedoSoldier
on Jun 18
Authors:
,
,

Abstract

ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

Community

Paper author Paper submitter
This comment has been hidden (marked as Resolved)
Paper author Paper submitter

๐Ÿš€ ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD

MLLMs can reason impressively โ€” but do they really look before they reason? ๐Ÿ‘€
Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.

ViGOS fixes this with a simple but powerful idea: see first, reason second. โœจ
The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts โ€” and all teachers are removed at inference time.

๐Ÿ“ˆ Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings โ€” helping models trust the image when priors are wrong. ๐Ÿ”ฅ

One-line pitch:
๐Ÿง โžก๏ธ๐Ÿ‘๏ธ ViGOS teaches MLLMs to ground visual evidence before reasoning โ€” reducing shortcuts without sacrificing strong answer guidance.

๐Ÿ”— Links

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.19120
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19120 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19120 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.