COMPASS-VLM Phase 2

Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the Phase 2 checkpoint of the COMPASS project. Starting from the Phase 1 Japanese VLM (Yana/compass-vlm-phase1), Phase 2 enhances the LLM component's mathematical and step-by-step reasoning capabilities via two consecutive stages: Supervised Fine-Tuning (SFT) on reasoning traces distilled from a Qwen3-30B teacher, followed by Direct Preference Optimization (DPO) against synthetically corrupted responses.

The resulting checkpoint retains the full VLM architecture (SigLIP-v2 + MLP projector + fine-tuned LLM-JP-4-8B), and serves as the bridge to the final financial domain adaptation in Phase 3.

Phase 2 was primarily implemented and executed by Genshin Kakimoto, within the COMPASS project led jointly with Atsushi Yanagisawa.

📦 Code: github.com/AtsushiYanaigsawa768/Compass (see phase2/)
📚 Collection: Yana/compass
⬅️ Previous stage: Yana/compass-vlm-phase1
➡️ Next stage: Yana/compass-vlm (Phase 3, financial domain)

Model Details

Item	Value
Model type	Vision-Language Model (LLaVA-OneVision-style) with reasoning-enhanced LLM
Parameters	~9B
Precision	BF16
Primary language	Japanese (with English math-reasoning capability)
Training paradigm	SFT + DPO (LoRA), adapters merged back into the base model
License	Apache-2.0 (see License)

Architecture

The visual pipeline is inherited unchanged from Phase 1. Phase 2 only updates the LLM:

Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B (SFT + DPO) ──► Output Text
Input Text ──────────────────────────────────────────────────┘

Component	Model	Status in Phase 2
Vision Encoder	`google/siglip2-so400m-patch14-384`	Frozen (inherited from Phase 1)
MLP Projector	Linear(1152→4096) → GELU → Linear(4096→4096)	Frozen (inherited from Phase 1)
LLM	LLM-JP-4-8B (Phase 1 merged)	Fine-tuned via SFT-LoRA, then DPO-LoRA; adapters merged

Training Procedure

Phase 2 follows a three-step recipe: knowledge distillation → SFT → DPO.

Step 0 — Knowledge Distillation (Data Generation)

A large teacher model (Qwen3-30B) generates XML-structured reasoning traces over a broad pool of mathematical reasoning datasets. The resulting data is released as:

Yana/ft-llm-2026-reasoning-sft — ~597k samples
Yana/ft-llm-2026-reasoning-dpo — ~130k preference pairs

Source reasoning datasets include:

Dataset	Approx. size
GSM8K	7.5k
MATH (Hendrycks)	12.5k
SVAMP	1k
AQuA-RAT	100k
MathInstruct (TIGER-Lab)	262k
MGSM-ja	250
OpenR1-Math	10k
Orca Math	200k
NuminaMath-CoT	50k
Open Math Reasoning	100k
OpenHermes-DPO, UltraFeedback	DPO source data

Step 1 — Supervised Fine-Tuning (SFT)

LoRA fine-tuning of the LLM on distilled reasoning traces, followed by adapter merging.

Parameter	Value
Base model	Phase 1 merged LLM (`Yana/compass-vlm-phase1`'s LLM component)
Dataset	`Yana/ft-llm-2026-reasoning-sft`
Learning rate	2e-4
Global batch size	64
Micro batch size	2
Epochs	1
Max sequence length	2048–4096
LoRA rank (`r`)	32
LoRA alpha	64
Optimizer	AdamW
Warmup ratio	0.03
Mixed precision	BF16
Gradient checkpointing	Enabled

Step 2 — DPO Pair Generation

Each SFT sample is turned into a (prompt, chosen, rejected) triple, where rejected is synthesized from chosen by one of three corruption strategies:

Strategy	Weight	Description
`omit_thinking`	0.34	Remove the contents of the `<Thinking>` tag entirely
`tamper_thinking_numbers`	0.33	Corrupt numerical values inside the reasoning
`tamper_answer`	0.33	Change the final answer while keeping the reasoning

Seed: 42. Samples that fail XML tag validation can optionally be filtered with --require_tags.

Step 3 — Direct Preference Optimization (DPO)

LoRA-based DPO starting from the SFT-merged model.

Parameter	Value
Base model	SFT-merged LLM (output of Step 1)
Reference model	Same as base (standard DPO setup)
Dataset	`Yana/ft-llm-2026-reasoning-dpo`
Learning rate	5e-6
Global batch size	32
Micro batch size	1
Epochs	1
DPO β (`beta`)	0.1
Max length	2048 (tunable)
Optimizer	AdamW
Mixed precision	BF16
Attention implementation	Flash-Attention 2 (recommended)

After DPO, the LoRA adapter is merged back into the LLM, and the LLM is recomposed with the frozen Phase 1 vision tower and projector to produce the final Phase 2 VLM published here.

Compute

Training Stage	GPUs (min)	VRAM / GPU	Recommended
Distillation (Qwen3-30B teacher, vLLM)	1	40 GB	8× A100 80 GB
SFT (LoRA, 8B)	4	40 GB	4× A100 40 GB
DPO (LoRA, 8B)	4	40 GB	4× A100 40 GB

Reasoning Output Format

The model is trained to wrap its reasoning in explicit XML tags, which makes post-hoc parsing and answer extraction straightforward.

System prompt (English):

You are an advanced mathematical AI assistant.
Your task is to solve the given math problem step-by-step and provide a final answer.

System prompt (Japanese equivalent):

あなたは高度な数学AIアシスタントです。
与えられた数学問題をステップバイステップで解き、最終回答を提示してください。

Expected output structure:

<Problem>
(Restatement of the problem)
</Problem>

<Thinking>
(Step-by-step reasoning)
</Thinking>

<Answer>
\boxed{final_answer}
</Answer>

For image-grounded inputs, the Phase 1 chat template and <image> token are still used; the above reasoning format is layered on top.

Intended Use

Direct Use

Japanese image captioning and VQA (inherited from Phase 1)
Mathematical word-problem solving with explicit chain-of-thought
Any task that benefits from structured <Problem>/<Thinking>/<Answer> responses

Downstream Use

Starting point for Phase 3 — financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / Japanese financial QA (final model: Yana/compass-vlm)
Base for further domain-specific SFT or DPO runs

Out-of-Scope Use

High-stakes decisions (medical, legal, financial advisory) without human oversight.
Arithmetic-heavy agent loops without external verification — the model can confidently produce wrong numbers.
Non-Japanese / non-English usage is not evaluated.

Evaluation

Phase 2 targets reasoning quality rather than vision-grounded tasks. The end-to-end COMPASS pipeline (Phase 1 → Phase 2 → Phase 3) is evaluated on:

GSM8K — English math reasoning
JP Harness (5 tasks) — Japanese financial multiple-choice
EDINET Bench (3 tasks) — Japanese financial classification

Phase 2 is typically compared against the Phase 1 starting point to isolate the gain from reasoning training. See the project repository and blog for numbers.

Limitations and Biases

Reasoning training is almost entirely on mathematical problems; improvements on non-math reasoning (commonsense, multi-hop QA) are likely smaller and were not explicitly measured.
DPO rejected responses are synthetic corruptions of the chosen response, not model-generated failures. This is efficient but may not cover all realistic failure modes.
English-language math data dominates the distilled corpus (MGSM-ja is the only explicitly Japanese math dataset); Japanese math reasoning coverage is therefore limited.
Visual capabilities are unchanged from Phase 1 — no additional VQA or OCR training was performed here.
The teacher model (Qwen3-30B) imposes a soft ceiling: systematic errors or stylistic quirks of the teacher may be inherited by the student.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Yana/compass-vlm-phase2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

For the full VLM inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, AnyRes handling, and XML-tagged prompting), please refer to the phase2/ directory in the GitHub repository.

Citation

@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}

Please also cite upstream works as appropriate:

DPO — Rafailov et al., arXiv:2305.18290
LoRA — Hu et al., arXiv:2106.09685
GSM8K — Cobbe et al., arXiv:2110.14168
Qwen (teacher) — arXiv:2309.16609
Phase 1 dependencies: LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP

License

This model is released under the Apache License 2.0.

Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model (both SFT and DPO stages) was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.

Acknowledgements

Built on top of outstanding open-source work, including:

LLM-JP-4-8B-Instruct — base LLM (via Phase 1)
SigLIP-v2 — vision encoder (via Phase 1)
Qwen3-30B — teacher model for reasoning distillation
TRL, PEFT, Transformers, vLLM, Accelerate, DeepSpeed
Reasoning datasets: GSM8K, MATH, SVAMP, AQuA-RAT, MathInstruct, MGSM, Orca Math, NuminaMath, OpenR1-Math, OpenHermes, UltraFeedback