Instructions to use OedoSoldier/ViGOS-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OedoSoldier/ViGOS-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="OedoSoldier/ViGOS-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("OedoSoldier/ViGOS-7B")
model = AutoModelForMultimodalLM.from_pretrained("OedoSoldier/ViGOS-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OedoSoldier/ViGOS-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OedoSoldier/ViGOS-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OedoSoldier/ViGOS-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/OedoSoldier/ViGOS-7B

SGLang

How to use OedoSoldier/ViGOS-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OedoSoldier/ViGOS-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OedoSoldier/ViGOS-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OedoSoldier/ViGOS-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OedoSoldier/ViGOS-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use OedoSoldier/ViGOS-7B with Docker Model Runner:
```
docker model run hf.co/OedoSoldier/ViGOS-7B
```

ViGOS-7B: Visual Grounding On-Policy Self-Distillation

Model Details

Field	Value
Model name	`ViGOS-7B`
Repository ID	`OedoSoldier/ViGOS-7B`
Model family	ViGOS
Model type	Multimodal image-text-to-text / vision-language reasoning model
Base model	`Qwen/Qwen2.5-VL-7B-Instruct`
Training method	Segment-wise multimodal on-policy self-distillation
Weight format	Merged full weights
Training data	LMMs-Lab-Turtle/Vision-SR1-47K
Output format	`<description>...</description><think>...</think>\boxed{...}`
Paper	Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Authors	Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han
Code	https://github.com/OedoSoldier/ViGOS
License	Apache license 2.0

This repository is for the 7B-scale ViGOS model only. The 3B-scale model should use a separate Hugging Face repository and model card.

Model Summary

ViGOS stands for Visual Grounding On-Policy Self-Distillation. It is a multimodal post-training method for reducing shortcut behavior in on-policy self-distillation for vision-language models. In vanilla OPSD, the privileged teacher can see the reference answer while supervising the whole student rollout. For MLLMs, that can make the dense training signal overly answer-driven before the model has grounded its response in image evidence.

ViGOS changes the supervision path by asking the student to first produce a visual description, then reason, then answer:

<description> visual description </description>
<think> reasoning process </think>
\boxed{FINAL ANSWER}

For valid training rollouts, ViGOS uses segment-wise teachers:

an image-only perception teacher supervises the description tokens;
a privileged reasoning teacher supervises reasoning and final-answer tokens after the student-generated description prefix exists;
a reference fallback teacher is used only for invalid or malformed rollouts to recover the required output format.

At inference time, all teachers, reference answers, and segment masks are removed. The model receives only the image, the question or instruction, and the output-format prompt.

Intended Use

This model is intended for research and development in multimodal reasoning tasks, including visual question answering, visual math and diagram reasoning, OCR- or chart-grounded reasoning, spatial reasoning, visual grounding, and shortcut/prior-sensitivity analysis.

Out-of-Scope Use

This model should not be used as the sole decision-maker in high-stakes settings such as medical diagnosis, legal judgment, financial decision-making, safety-critical robotics, surveillance, identity verification, or other contexts where hallucinated or incorrect visual reasoning could cause harm.

How to Use

pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_ID = "OedoSoldier/ViGOS-7B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

image_path = "path/to/image.jpg"
question = "What is the answer to the visual question?"

prompt = f"""Problem: {question}

You are tasked with analyzing an image to generate a detailed description that can help you answer the question. First analyze the image and produce a self-contained description, detailed enough to lead to the correct answer. Do not include the final answer in the description. Wrap the entire description in <description> </description> tags.

Next, reason step by step based on the image description and the image, and enclose this part within <think> </think> tags.

Finally, provide a single word or phrase answer to the question in \\boxed{{}}.
The output format should be: <description> image description here </description><think> reasoning process here </think> \\boxed{{FINAL ANSWER here}}.
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
        do_sample=True,
        temperature=1.0,
        top_p=0.90,
        top_k=20,
    )

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(output_text)

Training Details

The paper trains this model for one epoch on Vision-SR1-47K using 8 NVIDIA A100 GPUs. The student is trained on on-policy rollouts, and the frozen teacher roles are used only to score the student-generated prefixes during training.

Parameter	Value
Training epochs	1
GPUs	8 × A100
Effective batch size	32
Optimizer	Fused AdamW
Learning rate	5e-6
LR scheduler	Linear
Maximum gradient norm	0.1
Precision	bf16
Distributed training	ZeRO-2
Maximum prompt length	32,768
Maximum completion length	4,096
LoRA rank	64
LoRA alpha	128
LoRA dropout	0.05
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Rollout temperature	1.1
Rollout top-p / top-k	0.95 / 20
λ_perc	1.0
λ_rea	1.0
λ_ref	2.0
Distillation temperature	1.0
KL clipping	0.05

Evaluation Protocol

For the eight main benchmarks, the paper samples five stochastic responses per example and reports Pass@5 / Avg@5. Pass@5 checks whether at least one of the five sampled answers is correct, while Avg@5 is the mean correctness across all five samples.

For ViLP, the paper generates one response per prompt and reports Score / Prior. Score measures accuracy on visually diagnostic questions where the model must use the image, and Prior measures accuracy on prior-aligned questions where the common visual-language prior is correct.

Evaluation decoding settings:

Parameter	Value
Maximum generated tokens	4,096
Number of samples per main benchmark question	5
Temperature	1.0
Top-p	0.90
Top-k	20
Random seed	42

Evaluation Results

Main Benchmarks

Pass@5 / Avg@5, in percent:

Benchmark	ViGOS-7B
MM-Vet	72.02 / 54.40
MMMU	80.11 / 51.42
MMMU-Pro	64.81 / 36.48
MathVerse	68.91 / 44.77
MathVista	80.90 / 58.78
MMSI	61.10 / 25.58
RealWorldQA	85.88 / 62.88
CV-Bench	91.09 / 73.58
Mean across 8 benchmarks	75.60 / 50.99

Prior-Sensitive ViLP Results

Score / Prior, in percent:

Setting	ViGOS-7B
ViLP-F	62.67 / 97.00
ViLP-P	61.67 / 91.67

Ethical Considerations

Users should validate the model carefully before deployment. The model can generate plausible but incorrect visual descriptions and rationales. In user-facing applications, consider presenting only concise final answers, or clearly mark generated descriptions and rationales as model-generated rather than authoritative evidence.

Citation

Please cite the ViGOS paper if you use this model or method.

@misc{wang2026seeing,
  title={Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation},
  author={Wang, Sihan and Liu, Xiyao and Liu, Lianqing and Han, Zhi},
  year={2026},
  eprint={2606.19120},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2606.19120}
}

Downloads last month: 60

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for OedoSoldier/ViGOS-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(284)

this model

Adapters

1 model

Paper for OedoSoldier/ViGOS-7B

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Paper • 2606.19120 • Published 3 days ago • 1

Evaluation results

Mean Pass@5 on Eight Main Benchmarks Average
self-reported

75.600
Mean Avg@5 on Eight Main Benchmarks Average
self-reported

50.990
Score on ViLP-F
self-reported

62.670
Prior on ViLP-F
self-reported

97.000
Score on ViLP-P
self-reported

61.670
Prior on ViLP-P
self-reported

91.670

OedoSoldier
/

ViGOS-7B