baladithyab
Wave 4: data collator + loss composition smoke (38/38 tests pass)
157cdba

v0.0 Spike — Composer Replication Framework

Decomposed from the framework synthesis (framework/composer-replication-framework.md). Goal of v0.0: prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO, on the smallest viable model. If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient.

Risk-ordered decomposition

# Spike Validates (Given / When / Then) Why this risk first Status
001 001-teacher-replay-cost Given a frozen 100-step agentic-coding trace and a state at step t, when N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, then total per-trace teacher cost is < $5 and wallclock per step is < 30 s. If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. Kill-switch first. 🟢 VALIDATED (2026-05-25): $0.98/trace, p95 lat 20.5s, 0 errors
005 005-integrated-trainer-skeleton Given the SDPO loss math (lifted from siyan-zhao/OPSD) and the teacher-disagreement DPO-pair extractor, when we wire them into a GRPOTrainer subclass with α/β channel weights, then unit tests cover loss differentiability + correctness, and ablating any channel via α=0/β=0 reduces to GRPO. Proves the integration architecture compiles before paying GPU costs. Cheap (no GPU, no API). 🟢 SKELETON-VALIDATED + COMPOSITION-VERIFIED: 38/38 unit tests pass; 5-step gradient run on tiny model decreases loss with all 3 channels active
002a 002a-trace-collection-trl Given Qwen3-7B base + TRL GRPOTrainer + a SWE-bench-lite OpenEnv, when we run 100 rollouts, then all rollouts emit complete (state_t, action_t, reward_t) tuples to JSONL with no truncation or schema drift. Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. 📋 planned
002b 002b-trace-collection-prime-rl Same as 002a but with PRIME-RL substrate. Comparison: which framework's trace export is cleaner? 📋 planned
003 003-dpo-pairs-from-disagreement Given N=3 teacher action distributions per trace step and the student's own action, when we extract preference pairs by "majority of teachers > student" + "student > minority", then the resulting DPO dataset has ≥ 5 pairs/trace and a non-trivial KL distance from random pairs. The reward shape needs to actually carry signal, not just exist. Spike 005 already verified the extraction logic; spike 003 measures signal density on real traces. 📋 planned
004 004-ab-train-grpo-vs-trace-replay-dpo Given the trace dataset from 002, when we train two Qwen3-7B variants — (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO — and evaluate on SWE-bench-lite, then variant (B) outperforms (A) by ≥ 2 pt pass@1 with statistical significance. The terminal experiment that validates or invalidates the v0.0 claim. 📋 planned

Spike order rationale

  1. 001 (teacher cost) first — single most likely thing to kill the framework. Cheap to run (~$5–20), takes ~1 hour, no GPU.
  2. 002a / 002b in parallel — independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head.
  3. 003 reward-shape check — once we have any trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run.
  4. 004 the actual experiment — only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate.

Out of scope for v0.0 (deferred to v0.1)

  • Composer hint-distill = SDPO/OPSD (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. Code is published at github.com/siyan-zhao/OPSD; paper arXiv:2601.20802. Lift the loss for v0.1 — see docs/COMPOSER_RECIPE_MAPPING.md § "Implementation handles for v0.1" for the concrete plan.
  • The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
  • DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
  • Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
  • MoE base (use dense Qwen3-7B; saner v0.0 target)
  • VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)

Why deferring SDPO/hint-distill to v0.1 is the right call:

  1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
  2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
  3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
  4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.

v0.1 will run a 3-arm A/B: RLVR vs. RLVR + SDPO vs. RLVR + SDPO + trace-replay-DPO at 32B once we know v0.0's trace-replay verdict.

Budget

Item Estimate Source
Teacher API calls (OpenRouter) ~$50–150 100 traces × ~50 step replays × 3 teachers × ~$0.005/call
GPU compute (Qwen3-7B fine-tune × 2 variants) ~$60–120 Modal A100-80GB, ~8 hr each variant
Dev wallclock ~5–7 days Single operator
Total ~$200 + dev time Cheapest viable falsification of the novel claim

Success criteria for v0.0

  • 001: $/trace + s/step verdict in 001-teacher-replay-cost/README.md
  • 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1
  • 003: DPO-pair stats verdict
  • 004: A/B pass@1 with confidence interval, plain text and chart

If 004 is VALIDATED → publish the result, write v0.1 plan. If PARTIAL (e.g., only some teacher mixes work) → narrow the claim, re-spike with the working subset. If INVALIDATED → close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe.

Citations

All five primary research notes (research/01..05*.md) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time:

  • Cursor (2026): Composer 2.5 blog post — recipe shape and the targeted-RL hint-distillation idea
  • Microsoft (2024): rStar / rStar-Math — closest precedent to trace-replay (single-teacher MCTS)
  • Hugging Face (2025): TRL GRPOTrainer + OpenEnv integration — algorithm reference
  • Prime Intellect (2026): PRIME-RL + INTELLECT-2 — production decentralized substrate