Synthetic Traces, Real Reasoning
How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers
Author: Botoshi | Date: April 2026
Abstract
We present evidence that fine-tuning a 7B parameter language model on fully synthetic, procedurally generated document reasoning traces produces substantial gains on real-world scientific paper comprehension. A Qwen 2.5 7B model trained on 4,421 synthetic traces more than doubles its accuracy on real arXiv papers in DACR-Bench, from 18.9% to 40.0%, and increases the full-challenge pass rate from 0% to 44%. We also observe large gains in Causal Authority Resolution Score (CARS), from 7.7% to 46.2%, indicating improved ability to resolve conflicting claims by evidential provenance. In contrast, single-hop extraction remains flat, which suggests the model learns transferable reasoning procedures rather than additional lookup capacity.
We introduce DACR-Bench (Document Analysis with Causal Reasoning), a benchmark for multi-hop causal reasoning over real technical documents. DACR-Bench combines real arXiv papers with procedurally generated questions and deliberately planted conflicting information, requiring models to identify which claims carry greater evidential authority. We define the Causal Authority Resolution Score (CARS), a metric that isolates a model's ability to resolve contradictions by tracing evidential provenance.
The training data originates from a decentralized challenge network in which multiple frontier AI agents (GPT-5.4, Claude Haiku, Codex) independently solve structured document reasoning challenges. Each challenge is graded by deterministic automated verifiers, not human annotators. The resulting trace corpus exhibits natural diversity in reasoning style and problem decomposition.
Key Results
All results reported on 9 real arXiv document challenges (90 questions). Synthetic engine-generated challenges excluded from evaluation.
Overall Performance
| Metric | Baseline | Fine-tuned | Delta |
|---|---|---|---|
| Answer accuracy | 18.9% | 40.0% | +21.1% |
| Pass rate | 0/9 | 4/9 | +44% |
| Causal authority (CARS) | 7.7% | 46.2% | +38.5% |
Skill-Specific Transfer
| Skill Category | Baseline | Fine-tuned | Delta |
|---|---|---|---|
| Direct extraction | 58.3% | 58.3% | +0.0% |
| Multi-hop bridge | 5.6% | 33.3% | +27.8% |
| Computation | 0.0% | 28.6% | +28.6% |
| Conditional filtered | 0.0% | 42.9% | +42.9% |
| Conflict resolution | 7.7% | 46.2% | +38.5% |
| Cross-section synthesis | 0.0% | 40.0% | +40.0% |
Domain Transfer (no training data from any evaluation domain)
| Domain | Relative Improvement |
|---|---|
| Physics | +80% |
| Economics | +50% |
| Medicine | +40% |
| Chemistry | +30% |
Training Configuration
- Base model: Qwen 2.5 7B Instruct
- Method: QLoRA (4-bit NF4, rank 32, alpha 64)
- Loss: Completion-only (assistant response only)
- Data: 4,421 reasoning traces from 4 domains, 3 frontier models
- Training time: 2.8 hours on 1x H100
- Epochs: 1
Related Repositories
| Resource | Link |
|---|---|
| DACR-Bench (benchmark code + data) | github.com/botcoinmoney/dacr-bench |
| Training & evaluation code | github.com/botcoinmoney/synthetic-to-real-reasoning |
| Evaluation results dataset | HF: botcoinmoney/dacr-bench-results |
| Training data | HF: botcoinmoney/domain-agnostic-causal-reasoning-tuning |
Citation
@article{botoshi2026synthetic,
title={Synthetic Traces, Real Reasoning: How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers},
author={Botoshi},
year={2026},
note={Available at \url{https://huggingface.co/botcoinmoney/dacr-paper}}
}
License
Apache-2.0