Synthetic Traces, Real Reasoning

How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers

Author: Botoshi | Date: April 2026


Download paper.pdf


Abstract

We present evidence that fine-tuning a 7B parameter language model on fully synthetic, procedurally generated document reasoning traces produces substantial gains on real-world scientific paper comprehension. A Qwen 2.5 7B model trained on 4,421 synthetic traces more than doubles its accuracy on real arXiv papers in DACR-Bench, from 18.9% to 40.0%, and increases the full-challenge pass rate from 0% to 44%. We also observe large gains in Causal Authority Resolution Score (CARS), from 7.7% to 46.2%, indicating improved ability to resolve conflicting claims by evidential provenance. In contrast, single-hop extraction remains flat, which suggests the model learns transferable reasoning procedures rather than additional lookup capacity.

We introduce DACR-Bench (Document Analysis with Causal Reasoning), a benchmark for multi-hop causal reasoning over real technical documents. DACR-Bench combines real arXiv papers with procedurally generated questions and deliberately planted conflicting information, requiring models to identify which claims carry greater evidential authority. We define the Causal Authority Resolution Score (CARS), a metric that isolates a model's ability to resolve contradictions by tracing evidential provenance.

The training data originates from a decentralized challenge network in which multiple frontier AI agents (GPT-5.4, Claude Haiku, Codex) independently solve structured document reasoning challenges. Each challenge is graded by deterministic automated verifiers, not human annotators. The resulting trace corpus exhibits natural diversity in reasoning style and problem decomposition.

Key Results

All results reported on 9 real arXiv document challenges (90 questions). Synthetic engine-generated challenges excluded from evaluation.

Overall Performance

Metric Baseline Fine-tuned Delta
Answer accuracy 18.9% 40.0% +21.1%
Pass rate 0/9 4/9 +44%
Causal authority (CARS) 7.7% 46.2% +38.5%

Skill-Specific Transfer

Skill Category Baseline Fine-tuned Delta
Direct extraction 58.3% 58.3% +0.0%
Multi-hop bridge 5.6% 33.3% +27.8%
Computation 0.0% 28.6% +28.6%
Conditional filtered 0.0% 42.9% +42.9%
Conflict resolution 7.7% 46.2% +38.5%
Cross-section synthesis 0.0% 40.0% +40.0%

Domain Transfer (no training data from any evaluation domain)

Domain Relative Improvement
Physics +80%
Economics +50%
Medicine +40%
Chemistry +30%

Training Configuration

  • Base model: Qwen 2.5 7B Instruct
  • Method: QLoRA (4-bit NF4, rank 32, alpha 64)
  • Loss: Completion-only (assistant response only)
  • Data: 4,421 reasoning traces from 4 domains, 3 frontier models
  • Training time: 2.8 hours on 1x H100
  • Epochs: 1

Related Repositories

Resource Link
DACR-Bench (benchmark code + data) github.com/botcoinmoney/dacr-bench
Training & evaluation code github.com/botcoinmoney/synthetic-to-real-reasoning
Evaluation results dataset HF: botcoinmoney/dacr-bench-results
Training data HF: botcoinmoney/domain-agnostic-causal-reasoning-tuning

Citation

@article{botoshi2026synthetic,
  title={Synthetic Traces, Real Reasoning: How Procedurally Generated Document Challenges Transfer to Real-World Scientific Papers},
  author={Botoshi},
  year={2026},
  note={Available at \url{https://huggingface.co/botcoinmoney/dacr-paper}}
}

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support