# v0.0 Spike โ€” Composer Replication Framework > Decomposed from the framework synthesis (`framework/composer-replication-framework.md`). > Goal of v0.0: **prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO**, on the smallest viable model. > If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient. ## Risk-ordered decomposition | # | Spike | Validates (Given / When / Then) | Why this risk first | Status | |---|-------|----------------------------------|---------------------|--------| | **001** | `001-teacher-replay-cost` | **Given** a frozen 100-step agentic-coding trace and a state at step `t`, **when** N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, **then** total per-trace teacher cost is < $5 and wallclock per step is < 30 s. | If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. **Kill-switch first.** | ๐ŸŸข **VALIDATED** (2026-05-25): $0.98/trace, p95 lat 20.5s, 0 errors | | **005** | `005-integrated-trainer-skeleton` | **Given** the SDPO loss math (lifted from `siyan-zhao/OPSD`) and the teacher-disagreement DPO-pair extractor, **when** we wire them into a `GRPOTrainer` subclass with ฮฑ/ฮฒ channel weights, **then** unit tests cover loss differentiability + correctness, and ablating any channel via ฮฑ=0/ฮฒ=0 reduces to GRPO. | Proves the integration architecture compiles before paying GPU costs. Cheap (no GPU, no API). | ๐ŸŸข **SKELETON-VALIDATED + COMPOSITION-VERIFIED**: 38/38 unit tests pass; 5-step gradient run on tiny model decreases loss with all 3 channels active | | **002a** | `002a-trace-collection-trl` | **Given** Qwen3-7B base + TRL `GRPOTrainer` + a SWE-bench-lite OpenEnv, **when** we run 100 rollouts, **then** all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift. | Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. | ๐Ÿ“‹ planned | | **002b** | `002b-trace-collection-prime-rl` | Same as 002a but with PRIME-RL substrate. | Comparison: which framework's trace export is cleaner? | ๐Ÿ“‹ planned | | **003** | `003-dpo-pairs-from-disagreement` | **Given** N=3 teacher action distributions per trace step and the student's own action, **when** we extract preference pairs by "majority of teachers > student" + "student > minority", **then** the resulting DPO dataset has โ‰ฅ 5 pairs/trace and a non-trivial KL distance from random pairs. | The reward shape needs to actually carry signal, not just exist. Spike 005 already verified the *extraction logic*; spike 003 measures *signal density on real traces*. | ๐Ÿ“‹ planned | | **004** | `004-ab-train-grpo-vs-trace-replay-dpo` | **Given** the trace dataset from 002, **when** we train two Qwen3-7B variants โ€” (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO โ€” and evaluate on SWE-bench-lite, **then** variant (B) outperforms (A) by โ‰ฅ 2 pt pass@1 with statistical significance. | The terminal experiment that validates or invalidates the v0.0 claim. | ๐Ÿ“‹ planned | ## Spike order rationale 1. **001 (teacher cost) first** โ€” single most likely thing to kill the framework. Cheap to run (~$5โ€“20), takes ~1 hour, no GPU. 2. **002a / 002b in parallel** โ€” independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head. 3. **003 reward-shape check** โ€” once we have *any* trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run. 4. **004 the actual experiment** โ€” only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate. ## Out of scope for v0.0 (deferred to v0.1) - **Composer hint-distill = SDPO/OPSD** (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. **Code is published** at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD); paper [arXiv:2601.20802](https://arxiv.org/abs/2601.20802). Lift the loss for v0.1 โ€” see `docs/COMPOSER_RECIPE_MAPPING.md` ยง "Implementation handles for v0.1" for the concrete plan. - The Feature Deletion environment (use SWE-bench-lite as the env in v0.0) - DiLoCo / decentralized training (single-node FSDP2 is fine at 7B) - Monarch / Forge (use Ray + verifiers, the PRIME-RL stack) - MoE base (use dense Qwen3-7B; saner v0.0 target) - VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization) **Why deferring SDPO/hint-distill to v0.1 is the right call:** 1. The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research. 2. The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case. 3. Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost. 4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0. v0.1 will run a 3-arm A/B: **RLVR** vs. **RLVR + SDPO** vs. **RLVR + SDPO + trace-replay-DPO** at 32B once we know v0.0's trace-replay verdict. ## Budget | Item | Estimate | Source | |---|---|---| | Teacher API calls (OpenRouter) | ~$50โ€“150 | 100 traces ร— ~50 step replays ร— 3 teachers ร— ~$0.005/call | | GPU compute (Qwen3-7B fine-tune ร— 2 variants) | ~$60โ€“120 | Modal A100-80GB, ~8 hr each variant | | Dev wallclock | ~5โ€“7 days | Single operator | | **Total** | **~$200 + dev time** | Cheapest viable falsification of the novel claim | ## Success criteria for v0.0 - 001: $/trace + s/step verdict in `001-teacher-replay-cost/README.md` - 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1 - 003: DPO-pair stats verdict - 004: A/B pass@1 with confidence interval, plain text and chart If 004 is **VALIDATED** โ†’ publish the result, write v0.1 plan. If **PARTIAL** (e.g., only some teacher mixes work) โ†’ narrow the claim, re-spike with the working subset. If **INVALIDATED** โ†’ close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe. ## Citations All five primary research notes (`research/01..05*.md`) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time: - Cursor (2026): Composer 2.5 blog post โ€” recipe shape and the targeted-RL hint-distillation idea - Microsoft (2024): rStar / rStar-Math โ€” closest precedent to trace-replay (single-teacher MCTS) - Hugging Face (2025): TRL `GRPOTrainer` + OpenEnv integration โ€” algorithm reference - Prime Intellect (2026): PRIME-RL + INTELLECT-2 โ€” production decentralized substrate