SDG Detector — SFT Stage

arXiv

Stage-1 supervised fine-tuning of Qwen/Qwen3-VL-4B-Instruct on the SDG-30K training split. This checkpoint backs the "SDG (SFT)" row of Table 1 in the SDG paper.

The model emits structured defect-grounding output:

<think>
[Caption Understanding -> Visual Analysis -> Defect Spotting -> Localization]
</think>
<answer>
[
  {"box_2d": [y0, x0, y1, x1], "label": "artifact" | "misalignment", "desc": "..."},
  ...
]
</answer>

box_2d uses the [0, 1000] normalized convention (top, left, bottom, right).

Training summary

field value
base model Qwen/Qwen3-VL-4B-Instruct
training data SDG-30K train split (~85,770 prompt-response pairs after CoT distillation + jitter)
epochs 1 effective (5,360 steps × effective batch 16)
learning rate 3.0e-5, cosine, 5% warmup
coord jitter ±10 px, per-epoch resampling
vision encoder frozen
precision bf16
hardware 16 × A100-80G (2 nodes × 8)
optimizer state not redistributed (release-only checkpoint)

Quick start

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

ckpt = "P1n3/sdg-detector-sft"
processor = AutoProcessor.from_pretrained(ckpt)
model = AutoModelForImageTextToText.from_pretrained(
    ckpt, dtype=torch.bfloat16, device_map="auto",
)

The exact prompt template lives in the supplementary archive at sdg_detector/train/constants.py (question_template_registry).

Stage-2 GRPO Checkpoint

The Stage-2 GRPO detector is released separately as a merged full checkpoint: P1n3/sdg-detector-grpo. It can be loaded directly with transformers and should not be attached as a PEFT adapter.

License

cc-by-nc-4.0. Derivative of Qwen/Qwen3-VL-4B-Instruct (Apache-2.0). Released for non-commercial research use only.

Citation

@article{zhang2026and,
  title={Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback},
  author={Zhang, Huaisong and Yu, Hao and Zhang, Yuxuan and Wang, Jiahe and Chen, Xinrui and Cao, Haoxiang and Lu, Feng and Zhang, Wendong and Yu, Changqian and Yuan, Chun},
  journal={arXiv preprint arXiv:2606.06113},
  year={2026}
}
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for P1n3/sdg-detector-sft

Finetuned
(299)
this model
Finetunes
1 model

Collection including P1n3/sdg-detector-sft

Paper for P1n3/sdg-detector-sft