COMPASS-VLM Phase 2
Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)
This model is the Phase 2 checkpoint of the COMPASS project. Starting from the Phase 1 Japanese VLM (Yana/compass-vlm-phase1), Phase 2 enhances the LLM component's mathematical and step-by-step reasoning capabilities via two consecutive stages: Supervised Fine-Tuning (SFT) on reasoning traces distilled from a Qwen3-30B teacher, followed by Direct Preference Optimization (DPO) against synthetically corrupted responses.
The resulting checkpoint retains the full VLM architecture (SigLIP-v2 + MLP projector + fine-tuned LLM-JP-4-8B), and serves as the bridge to the final financial domain adaptation in Phase 3.
Phase 2 was primarily implemented and executed by Genshin Kakimoto, within the COMPASS project led jointly with Atsushi Yanagisawa.
- 📦 Code: github.com/AtsushiYanaigsawa768/Compass (see
phase2/) - 📚 Collection: Yana/compass
- ⬅️ Previous stage: Yana/compass-vlm-phase1
- ➡️ Next stage: Yana/compass-vlm (Phase 3, financial domain)
Model Details
| Item | Value |
|---|---|
| Model type | Vision-Language Model (LLaVA-OneVision-style) with reasoning-enhanced LLM |
| Parameters | ~9B |
| Precision | BF16 |
| Primary language | Japanese (with English math-reasoning capability) |
| Training paradigm | SFT + DPO (LoRA), adapters merged back into the base model |
| License | Apache-2.0 (see License) |
Architecture
The visual pipeline is inherited unchanged from Phase 1. Phase 2 only updates the LLM:
Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
├──► LLM-JP-4-8B (SFT + DPO) ──► Output Text
Input Text ──────────────────────────────────────────────────┘
| Component | Model | Status in Phase 2 |
|---|---|---|
| Vision Encoder | google/siglip2-so400m-patch14-384 |
Frozen (inherited from Phase 1) |
| MLP Projector | Linear(1152→4096) → GELU → Linear(4096→4096) | Frozen (inherited from Phase 1) |
| LLM | LLM-JP-4-8B (Phase 1 merged) | Fine-tuned via SFT-LoRA, then DPO-LoRA; adapters merged |
Training Procedure
Phase 2 follows a three-step recipe: knowledge distillation → SFT → DPO.
Step 0 — Knowledge Distillation (Data Generation)
A large teacher model (Qwen3-30B) generates XML-structured reasoning traces over a broad pool of mathematical reasoning datasets. The resulting data is released as:
- Yana/ft-llm-2026-reasoning-sft — ~597k samples
- Yana/ft-llm-2026-reasoning-dpo — ~130k preference pairs
Source reasoning datasets include:
| Dataset | Approx. size |
|---|---|
| GSM8K | 7.5k |
| MATH (Hendrycks) | 12.5k |
| SVAMP | 1k |
| AQuA-RAT | 100k |
| MathInstruct (TIGER-Lab) | 262k |
| MGSM-ja | 250 |
| OpenR1-Math | 10k |
| Orca Math | 200k |
| NuminaMath-CoT | 50k |
| Open Math Reasoning | 100k |
| OpenHermes-DPO, UltraFeedback | DPO source data |
Step 1 — Supervised Fine-Tuning (SFT)
LoRA fine-tuning of the LLM on distilled reasoning traces, followed by adapter merging.
| Parameter | Value |
|---|---|
| Base model | Phase 1 merged LLM (Yana/compass-vlm-phase1's LLM component) |
| Dataset | Yana/ft-llm-2026-reasoning-sft |
| Learning rate | 2e-4 |
| Global batch size | 64 |
| Micro batch size | 2 |
| Epochs | 1 |
| Max sequence length | 2048–4096 |
LoRA rank (r) |
32 |
| LoRA alpha | 64 |
| Optimizer | AdamW |
| Warmup ratio | 0.03 |
| Mixed precision | BF16 |
| Gradient checkpointing | Enabled |
Step 2 — DPO Pair Generation
Each SFT sample is turned into a (prompt, chosen, rejected) triple, where rejected is synthesized from chosen by one of three corruption strategies:
| Strategy | Weight | Description |
|---|---|---|
omit_thinking |
0.34 | Remove the contents of the <Thinking> tag entirely |
tamper_thinking_numbers |
0.33 | Corrupt numerical values inside the reasoning |
tamper_answer |
0.33 | Change the final answer while keeping the reasoning |
Seed: 42. Samples that fail XML tag validation can optionally be filtered with --require_tags.
Step 3 — Direct Preference Optimization (DPO)
LoRA-based DPO starting from the SFT-merged model.
| Parameter | Value |
|---|---|
| Base model | SFT-merged LLM (output of Step 1) |
| Reference model | Same as base (standard DPO setup) |
| Dataset | Yana/ft-llm-2026-reasoning-dpo |
| Learning rate | 5e-6 |
| Global batch size | 32 |
| Micro batch size | 1 |
| Epochs | 1 |
DPO β (beta) |
0.1 |
| Max length | 2048 (tunable) |
| Optimizer | AdamW |
| Mixed precision | BF16 |
| Attention implementation | Flash-Attention 2 (recommended) |
After DPO, the LoRA adapter is merged back into the LLM, and the LLM is recomposed with the frozen Phase 1 vision tower and projector to produce the final Phase 2 VLM published here.
Compute
| Training Stage | GPUs (min) | VRAM / GPU | Recommended |
|---|---|---|---|
| Distillation (Qwen3-30B teacher, vLLM) | 1 | 40 GB | 8× A100 80 GB |
| SFT (LoRA, 8B) | 4 | 40 GB | 4× A100 40 GB |
| DPO (LoRA, 8B) | 4 | 40 GB | 4× A100 40 GB |
Reasoning Output Format
The model is trained to wrap its reasoning in explicit XML tags, which makes post-hoc parsing and answer extraction straightforward.
System prompt (English):
You are an advanced mathematical AI assistant.
Your task is to solve the given math problem step-by-step and provide a final answer.
System prompt (Japanese equivalent):
あなたは高度な数学AIアシスタントです。
与えられた数学問題をステップバイステップで解き、最終回答を提示してください。
Expected output structure:
<Problem>
(Restatement of the problem)
</Problem>
<Thinking>
(Step-by-step reasoning)
</Thinking>
<Answer>
\boxed{final_answer}
</Answer>
For image-grounded inputs, the Phase 1 chat template and <image> token are still used; the above reasoning format is layered on top.
Intended Use
Direct Use
- Japanese image captioning and VQA (inherited from Phase 1)
- Mathematical word-problem solving with explicit chain-of-thought
- Any task that benefits from structured
<Problem>/<Thinking>/<Answer>responses
Downstream Use
- Starting point for Phase 3 — financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / Japanese financial QA (final model: Yana/compass-vlm)
- Base for further domain-specific SFT or DPO runs
Out-of-Scope Use
- High-stakes decisions (medical, legal, financial advisory) without human oversight.
- Arithmetic-heavy agent loops without external verification — the model can confidently produce wrong numbers.
- Non-Japanese / non-English usage is not evaluated.
Evaluation
Phase 2 targets reasoning quality rather than vision-grounded tasks. The end-to-end COMPASS pipeline (Phase 1 → Phase 2 → Phase 3) is evaluated on:
- GSM8K — English math reasoning
- JP Harness (5 tasks) — Japanese financial multiple-choice
- EDINET Bench (3 tasks) — Japanese financial classification
Phase 2 is typically compared against the Phase 1 starting point to isolate the gain from reasoning training. See the project repository and blog for numbers.
Limitations and Biases
- Reasoning training is almost entirely on mathematical problems; improvements on non-math reasoning (commonsense, multi-hop QA) are likely smaller and were not explicitly measured.
- DPO rejected responses are synthetic corruptions of the chosen response, not model-generated failures. This is efficient but may not cover all realistic failure modes.
- English-language math data dominates the distilled corpus (MGSM-ja is the only explicitly Japanese math dataset); Japanese math reasoning coverage is therefore limited.
- Visual capabilities are unchanged from Phase 1 — no additional VQA or OCR training was performed here.
- The teacher model (Qwen3-30B) imposes a soft ceiling: systematic errors or stylistic quirks of the teacher may be inherited by the student.
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Yana/compass-vlm-phase2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
For the full VLM inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, AnyRes handling, and XML-tagged prompting), please refer to the phase2/ directory in the GitHub repository.
Citation
@misc{compass2026,
title = {COMPASS: Development of a Japanese Financial VLM through
Integration of Reasoning Enhancement and Document Comprehension},
author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
year = {2026},
howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
note = {FT-LLM 2026 free-form task}
}
Please also cite upstream works as appropriate:
- DPO — Rafailov et al., arXiv:2305.18290
- LoRA — Hu et al., arXiv:2106.09685
- GSM8K — Cobbe et al., arXiv:2110.14168
- Qwen (teacher) — arXiv:2309.16609
- Phase 1 dependencies: LLaVA-1.5, LLaVA-OneVision, SigLIP, LLM-JP
License
This model is released under the Apache License 2.0.
Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model (both SFT and DPO stages) was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.
Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.
Acknowledgements
Built on top of outstanding open-source work, including:
- LLM-JP-4-8B-Instruct — base LLM (via Phase 1)
- SigLIP-v2 — vision encoder (via Phase 1)
- Qwen3-30B — teacher model for reasoning distillation
- TRL, PEFT, Transformers, vLLM, Accelerate, DeepSpeed
- Reasoning datasets: GSM8K, MATH, SVAMP, AQuA-RAT, MathInstruct, MGSM, Orca Math, NuminaMath, OpenR1-Math, OpenHermes, UltraFeedback
- Downloads last month
- -