Synthetic Traces, Real Reasoning

How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers

Author: Botoshi | Date: April 2026

Abstract

We present evidence that fine-tuning a 7B parameter language model on fully synthetic, procedurally generated document reasoning traces produces substantial gains on real-world scientific paper comprehension. A Qwen 2.5 7B model trained on 4,421 synthetic traces more than doubles its accuracy on real arXiv papers in DACR-Bench, from 18.9% to 40.0%, and increases the full-challenge pass rate from 0% to 44%. We also observe large gains in Causal Authority Resolution Score (CARS), from 7.7% to 46.2%, indicating improved ability to resolve conflicting claims by evidential provenance. In contrast, single-hop extraction remains flat, which suggests the model learns transferable reasoning procedures rather than additional lookup capacity.

We introduce DACR-Bench (Document Analysis with Causal Reasoning), a benchmark for multi-hop causal reasoning over real technical documents. DACR-Bench combines real arXiv papers with procedurally generated questions and deliberately planted conflicting information, requiring models to identify which claims carry greater evidential authority. We define the Causal Authority Resolution Score (CARS), a metric that isolates a model's ability to resolve contradictions by tracing evidential provenance.

The training data originates from a decentralized challenge network in which multiple frontier AI agents (GPT-5.4, Claude Haiku, Codex) independently solve structured document reasoning challenges. Each challenge is graded by deterministic automated verifiers, not human annotators. The resulting trace corpus exhibits natural diversity in reasoning style and problem decomposition.

Key Results

All results reported on 9 real arXiv document challenges (90 questions). Synthetic engine-generated challenges excluded from evaluation.

Overall Performance

Metric	Baseline	Fine-tuned	Delta
Answer accuracy	18.9%	40.0%	+21.1%
Pass rate	0/9	4/9	+44%
Causal authority (CARS)	7.7%	46.2%	+38.5%

Skill-Specific Transfer

Skill Category	Baseline	Fine-tuned	Delta
Direct extraction	58.3%	58.3%	+0.0%
Multi-hop bridge	5.6%	33.3%	+27.8%
Computation	0.0%	28.6%	+28.6%
Conditional filtered	0.0%	42.9%	+42.9%
Conflict resolution	7.7%	46.2%	+38.5%
Cross-section synthesis	0.0%	40.0%	+40.0%

Domain Transfer (no training data from any evaluation domain)

Domain	Relative Improvement
Physics	+80%
Economics	+50%
Medicine	+40%
Chemistry	+30%

Training Configuration

Base model: Qwen 2.5 7B Instruct
Method: QLoRA (4-bit NF4, rank 32, alpha 64)
Loss: Completion-only (assistant response only)
Data: 4,421 reasoning traces from 4 domains, 3 frontier models
Training time: 2.8 hours on 1x H100
Epochs: 1

Related Repositories

Resource	Link
DACR-Bench (benchmark code + data)	github.com/botcoinmoney/dacr-bench
Training & evaluation code	github.com/botcoinmoney/synthetic-to-real-reasoning
Evaluation results dataset	HF: botcoinmoney/dacr-bench-results
Training data	HF: botcoinmoney/domain-agnostic-causal-reasoning-tuning

Citation

@article{botoshi2026synthetic,
  title={Synthetic Traces, Real Reasoning: How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers},
  author={Botoshi},
  year={2026},
  note={Available at \url{https://huggingface.co/botcoinmoney/dacr-paper}}
}

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support