ViGOS-7B: Visual Grounding On-Policy Self-Distillation

Model Details

Field Value
Model name ViGOS-7B
Repository ID OedoSoldier/ViGOS-7B
Model family ViGOS
Model type Multimodal image-text-to-text / vision-language reasoning model
Base model Qwen/Qwen2.5-VL-7B-Instruct
Training method Segment-wise multimodal on-policy self-distillation
Weight format Merged full weights
Training data LMMs-Lab-Turtle/Vision-SR1-47K
Output format <description>...</description><think>...</think>\boxed{...}
Paper Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Authors Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han
Code https://github.com/OedoSoldier/ViGOS
License Apache license 2.0

This repository is for the 7B-scale ViGOS model only. The 3B-scale model should use a separate Hugging Face repository and model card.

Model Summary

ViGOS stands for Visual Grounding On-Policy Self-Distillation. It is a multimodal post-training method for reducing shortcut behavior in on-policy self-distillation for vision-language models. In vanilla OPSD, the privileged teacher can see the reference answer while supervising the whole student rollout. For MLLMs, that can make the dense training signal overly answer-driven before the model has grounded its response in image evidence.

ViGOS changes the supervision path by asking the student to first produce a visual description, then reason, then answer:

<description> visual description </description>
<think> reasoning process </think>
\boxed{FINAL ANSWER}

For valid training rollouts, ViGOS uses segment-wise teachers:

  • an image-only perception teacher supervises the description tokens;
  • a privileged reasoning teacher supervises reasoning and final-answer tokens after the student-generated description prefix exists;
  • a reference fallback teacher is used only for invalid or malformed rollouts to recover the required output format.

At inference time, all teachers, reference answers, and segment masks are removed. The model receives only the image, the question or instruction, and the output-format prompt.

Intended Use

This model is intended for research and development in multimodal reasoning tasks, including visual question answering, visual math and diagram reasoning, OCR- or chart-grounded reasoning, spatial reasoning, visual grounding, and shortcut/prior-sensitivity analysis.

Out-of-Scope Use

This model should not be used as the sole decision-maker in high-stakes settings such as medical diagnosis, legal judgment, financial decision-making, safety-critical robotics, surveillance, identity verification, or other contexts where hallucinated or incorrect visual reasoning could cause harm.

How to Use

pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "OedoSoldier/ViGOS-7B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

image_path = "path/to/image.jpg"
question = "What is the answer to the visual question?"

prompt = f"""Problem: {question}

You are tasked with analyzing an image to generate a detailed description that can help you answer the question. First analyze the image and produce a self-contained description, detailed enough to lead to the correct answer. Do not include the final answer in the description. Wrap the entire description in <description> </description> tags.

Next, reason step by step based on the image description and the image, and enclose this part within <think> </think> tags.

Finally, provide a single word or phrase answer to the question in \\boxed{{}}.
The output format should be: <description> image description here </description><think> reasoning process here </think> \\boxed{{FINAL ANSWER here}}.
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
        do_sample=True,
        temperature=1.0,
        top_p=0.90,
        top_k=20,
    )

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(output_text)

Recommended Answer Extraction

For benchmark-style evaluation, the paper extracts the final answer from the last \boxed{...} span. Outputs without a parseable final answer are counted as incorrect.

Training Details

The paper trains this model for one epoch on Vision-SR1-47K using 8 NVIDIA A100 GPUs. The student is trained on on-policy rollouts, and the frozen teacher roles are used only to score the student-generated prefixes during training.

Parameter Value
Training epochs 1
GPUs 8 × A100
Effective batch size 32
Optimizer Fused AdamW
Learning rate 5e-6
LR scheduler Linear
Maximum gradient norm 0.1
Precision bf16
Distributed training ZeRO-2
Maximum prompt length 32,768
Maximum completion length 4,096
LoRA rank 64
LoRA alpha 128
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Rollout temperature 1.1
Rollout top-p / top-k 0.95 / 20
λ_perc 1.0
λ_rea 1.0
λ_ref 2.0
Distillation temperature 1.0
KL clipping 0.05

Evaluation Protocol

For the eight main benchmarks, the paper samples five stochastic responses per example and reports Pass@5 / Avg@5. Pass@5 checks whether at least one of the five sampled answers is correct, while Avg@5 is the mean correctness across all five samples.

For ViLP, the paper generates one response per prompt and reports Score / Prior. Score measures accuracy on visually diagnostic questions where the model must use the image, and Prior measures accuracy on prior-aligned questions where the common visual-language prior is correct.

Evaluation decoding settings:

Parameter Value
Maximum generated tokens 4,096
Number of samples per main benchmark question 5
Temperature 1.0
Top-p 0.90
Top-k 20
Random seed 42

Evaluation Results

Main Benchmarks

Pass@5 / Avg@5, in percent:

Benchmark ViGOS-7B
MM-Vet 72.02 / 54.40
MMMU 80.11 / 51.42
MMMU-Pro 64.81 / 36.48
MathVerse 68.91 / 44.77
MathVista 80.90 / 58.78
MMSI 61.10 / 25.58
RealWorldQA 85.88 / 62.88
CV-Bench 91.09 / 73.58
Mean across 8 benchmarks 75.60 / 50.99

Prior-Sensitive ViLP Results

Score / Prior, in percent:

Setting ViGOS-7B
ViLP-F 62.67 / 97.00
ViLP-P 61.67 / 91.67

Ethical Considerations

Users should validate the model carefully before deployment. The model can generate plausible but incorrect visual descriptions and rationales. In user-facing applications, consider presenting only concise final answers, or clearly mark generated descriptions and rationales as model-generated rather than authoritative evidence.

Citation

Please cite the ViGOS paper if you use this model or method.

@misc{wang2026seeing,
  title={Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation},
  author={Wang, Sihan and Liu, Xiyao and Liu, Lianqing and Han, Zhi},
  year={2026},
  eprint={2606.19120},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19120}
}
Downloads last month
60
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OedoSoldier/ViGOS-7B

Adapter
(284)
this model
Adapters
1 model

Paper for OedoSoldier/ViGOS-7B

Evaluation results