COMPASS-VLM Phase 2

Development of a Japanese Financial VLM through Integration of Reasoning Enhancement and Document Comprehension (推論強化と文書読解の統合による日本語金融VLMの開発)

This model is the Phase 2 checkpoint of the COMPASS project. Starting from the Phase 1 Japanese VLM (Yana/compass-vlm-phase1), Phase 2 enhances the LLM component's mathematical and step-by-step reasoning capabilities via two consecutive stages: Supervised Fine-Tuning (SFT) on reasoning traces distilled from a Qwen3-30B teacher, followed by Direct Preference Optimization (DPO) against synthetically corrupted responses.

The resulting checkpoint retains the full VLM architecture (SigLIP-v2 + MLP projector + fine-tuned LLM-JP-4-8B), and serves as the bridge to the final financial domain adaptation in Phase 3.

Phase 2 was primarily implemented and executed by Genshin Kakimoto, within the COMPASS project led jointly with Atsushi Yanagisawa.


Model Details

Item Value
Model type Vision-Language Model (LLaVA-OneVision-style) with reasoning-enhanced LLM
Parameters ~9B
Precision BF16
Primary language Japanese (with English math-reasoning capability)
Training paradigm SFT + DPO (LoRA), adapters merged back into the base model
License Apache-2.0 (see License)

Architecture

The visual pipeline is inherited unchanged from Phase 1. Phase 2 only updates the LLM:

Input Image ──► SigLIP-v2 Vision Encoder ──► MLP Projector ──┐
                                                             ├──► LLM-JP-4-8B (SFT + DPO) ──► Output Text
Input Text ──────────────────────────────────────────────────┘
Component Model Status in Phase 2
Vision Encoder google/siglip2-so400m-patch14-384 Frozen (inherited from Phase 1)
MLP Projector Linear(1152→4096) → GELU → Linear(4096→4096) Frozen (inherited from Phase 1)
LLM LLM-JP-4-8B (Phase 1 merged) Fine-tuned via SFT-LoRA, then DPO-LoRA; adapters merged

Training Procedure

Phase 2 follows a three-step recipe: knowledge distillation → SFT → DPO.

Step 0 — Knowledge Distillation (Data Generation)

A large teacher model (Qwen3-30B) generates XML-structured reasoning traces over a broad pool of mathematical reasoning datasets. The resulting data is released as:

Source reasoning datasets include:

Dataset Approx. size
GSM8K 7.5k
MATH (Hendrycks) 12.5k
SVAMP 1k
AQuA-RAT 100k
MathInstruct (TIGER-Lab) 262k
MGSM-ja 250
OpenR1-Math 10k
Orca Math 200k
NuminaMath-CoT 50k
Open Math Reasoning 100k
OpenHermes-DPO, UltraFeedback DPO source data

Step 1 — Supervised Fine-Tuning (SFT)

LoRA fine-tuning of the LLM on distilled reasoning traces, followed by adapter merging.

Parameter Value
Base model Phase 1 merged LLM (Yana/compass-vlm-phase1's LLM component)
Dataset Yana/ft-llm-2026-reasoning-sft
Learning rate 2e-4
Global batch size 64
Micro batch size 2
Epochs 1
Max sequence length 2048–4096
LoRA rank (r) 32
LoRA alpha 64
Optimizer AdamW
Warmup ratio 0.03
Mixed precision BF16
Gradient checkpointing Enabled

Step 2 — DPO Pair Generation

Each SFT sample is turned into a (prompt, chosen, rejected) triple, where rejected is synthesized from chosen by one of three corruption strategies:

Strategy Weight Description
omit_thinking 0.34 Remove the contents of the <Thinking> tag entirely
tamper_thinking_numbers 0.33 Corrupt numerical values inside the reasoning
tamper_answer 0.33 Change the final answer while keeping the reasoning

Seed: 42. Samples that fail XML tag validation can optionally be filtered with --require_tags.

Step 3 — Direct Preference Optimization (DPO)

LoRA-based DPO starting from the SFT-merged model.

Parameter Value
Base model SFT-merged LLM (output of Step 1)
Reference model Same as base (standard DPO setup)
Dataset Yana/ft-llm-2026-reasoning-dpo
Learning rate 5e-6
Global batch size 32
Micro batch size 1
Epochs 1
DPO β (beta) 0.1
Max length 2048 (tunable)
Optimizer AdamW
Mixed precision BF16
Attention implementation Flash-Attention 2 (recommended)

After DPO, the LoRA adapter is merged back into the LLM, and the LLM is recomposed with the frozen Phase 1 vision tower and projector to produce the final Phase 2 VLM published here.

Compute

Training Stage GPUs (min) VRAM / GPU Recommended
Distillation (Qwen3-30B teacher, vLLM) 1 40 GB 8× A100 80 GB
SFT (LoRA, 8B) 4 40 GB 4× A100 40 GB
DPO (LoRA, 8B) 4 40 GB 4× A100 40 GB

Reasoning Output Format

The model is trained to wrap its reasoning in explicit XML tags, which makes post-hoc parsing and answer extraction straightforward.

System prompt (English):

You are an advanced mathematical AI assistant.
Your task is to solve the given math problem step-by-step and provide a final answer.

System prompt (Japanese equivalent):

あなたは高度な数学AIアシスタントです。
与えられた数学問題をステップバイステップで解き、最終回答を提示してください。

Expected output structure:

<Problem>
(Restatement of the problem)
</Problem>

<Thinking>
(Step-by-step reasoning)
</Thinking>

<Answer>
\boxed{final_answer}
</Answer>

For image-grounded inputs, the Phase 1 chat template and <image> token are still used; the above reasoning format is layered on top.


Intended Use

Direct Use

  • Japanese image captioning and VQA (inherited from Phase 1)
  • Mathematical word-problem solving with explicit chain-of-thought
  • Any task that benefits from structured <Problem>/<Thinking>/<Answer> responses

Downstream Use

  • Starting point for Phase 3 — financial domain fine-tuning on TAT-QA / ConvFinQA / FinQA / Japanese financial QA (final model: Yana/compass-vlm)
  • Base for further domain-specific SFT or DPO runs

Out-of-Scope Use

  • High-stakes decisions (medical, legal, financial advisory) without human oversight.
  • Arithmetic-heavy agent loops without external verification — the model can confidently produce wrong numbers.
  • Non-Japanese / non-English usage is not evaluated.

Evaluation

Phase 2 targets reasoning quality rather than vision-grounded tasks. The end-to-end COMPASS pipeline (Phase 1 → Phase 2 → Phase 3) is evaluated on:

  • GSM8K — English math reasoning
  • JP Harness (5 tasks) — Japanese financial multiple-choice
  • EDINET Bench (3 tasks) — Japanese financial classification

Phase 2 is typically compared against the Phase 1 starting point to isolate the gain from reasoning training. See the project repository and blog for numbers.


Limitations and Biases

  • Reasoning training is almost entirely on mathematical problems; improvements on non-math reasoning (commonsense, multi-hop QA) are likely smaller and were not explicitly measured.
  • DPO rejected responses are synthetic corruptions of the chosen response, not model-generated failures. This is efficient but may not cover all realistic failure modes.
  • English-language math data dominates the distilled corpus (MGSM-ja is the only explicitly Japanese math dataset); Japanese math reasoning coverage is therefore limited.
  • Visual capabilities are unchanged from Phase 1 — no additional VQA or OCR training was performed here.
  • The teacher model (Qwen3-30B) imposes a soft ceiling: systematic errors or stylistic quirks of the teacher may be inherited by the student.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Yana/compass-vlm-phase2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

For the full VLM inference pipeline (image preprocessing with SigLIP-v2, <image> token expansion, AnyRes handling, and XML-tagged prompting), please refer to the phase2/ directory in the GitHub repository.


Citation

@misc{compass2026,
  title  = {COMPASS: Development of a Japanese Financial VLM through
            Integration of Reasoning Enhancement and Document Comprehension},
  author = {Yanagisawa, Atsushi and Kakimoto, Genshin},
  year   = {2026},
  howpublished = {\url{https://github.com/AtsushiYanaigsawa768/Compass}},
  note   = {FT-LLM 2026 free-form task}
}

Please also cite upstream works as appropriate:


License

This model is released under the Apache License 2.0.

Note on training data and Japanese copyright law: Under Article 30-4 of the Japanese Copyright Act, the use of copyrighted works for the purpose of information analysis — including machine learning model training — is a permitted use that does not require authorization from, or trigger license conditions of, the copyright holders. Training of this model (both SFT and DPO stages) was conducted in Japan on this basis; the resulting model weights are redistributed under Apache-2.0.

Downstream users are responsible for complying with the licenses of any datasets or images they use for further fine-tuning or evaluation.


Acknowledgements

Built on top of outstanding open-source work, including:

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yana/compass-vlm-phase2

Adapter
(1)
this model
Adapters
1 model

Datasets used to train Yana/compass-vlm-phase2

Collection including Yana/compass-vlm-phase2

Papers for Yana/compass-vlm-phase2