arxiv:2606.19120

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Published on Jun 17

· Submitted by

OedoSoldier on Jun 18

Upvote

Authors:

Sihan Wang ,

Abstract

ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

OedoSoldier

Paper author Paper submitter 1 day ago

This comment has been hidden (marked as Resolved)

OedoSoldier

Paper author Paper submitter about 18 hours ago

🚀 ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD

MLLMs can reason impressively — but do they really look before they reason? 👀
Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.

ViGOS fixes this with a simple but powerful idea: see first, reason second. ✨
The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.

📈 Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥

One-line pitch:
🧠➡️👁️ ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.

🔗 Links

Project Page: https://oedosoldier.github.io/ViGOS/
Paper: https://arxiv.org/abs/2606.19120
Code: https://github.com/OedoSoldier/ViGOS
ViGOS-3B: https://huggingface.co/OedoSoldier/ViGOS-3B
ViGOS-7B: https://huggingface.co/OedoSoldier/ViGOS-7B

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.19120

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19120 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19120 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.