Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
v0.0 Spike — Composer Replication Framework
Decomposed from the framework synthesis (
framework/composer-replication-framework.md). Goal of v0.0: prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO, on the smallest viable model. If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient.
Risk-ordered decomposition
| # | Spike | Validates (Given / When / Then) | Why this risk first | Status |
|---|---|---|---|---|
| 001 | 001-teacher-replay-cost |
Given a frozen 100-step agentic-coding trace and a state at step t, when N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, then total per-trace teacher cost is < $5 and wallclock per step is < 30 s. |
If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. Kill-switch first. | 🟢 VALIDATED (2026-05-25): $0.98/trace, p95 lat 20.5s, 0 errors |
| 005 | 005-integrated-trainer-skeleton |
Given the SDPO loss math (lifted from siyan-zhao/OPSD) and the teacher-disagreement DPO-pair extractor, when we wire them into a GRPOTrainer subclass with α/β channel weights, then unit tests cover loss differentiability + correctness, and ablating any channel via α=0/β=0 reduces to GRPO. |
Proves the integration architecture compiles before paying GPU costs. Cheap (no GPU, no API). | 🟢 SKELETON-VALIDATED + COMPOSITION-VERIFIED: 38/38 unit tests pass; 5-step gradient run on tiny model decreases loss with all 3 channels active |
| 002a | 002a-trace-collection-trl |
Given Qwen3-7B base + TRL GRPOTrainer + a SWE-bench-lite OpenEnv, when we run 100 rollouts, then all rollouts emit complete (state_t, action_t, reward_t) tuples to JSONL with no truncation or schema drift. |
Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. | 📋 planned |
| 002b | 002b-trace-collection-prime-rl |
Same as 002a but with PRIME-RL substrate. | Comparison: which framework's trace export is cleaner? | 📋 planned |
| 003 | 003-dpo-pairs-from-disagreement |
Given N=3 teacher action distributions per trace step and the student's own action, when we extract preference pairs by "majority of teachers > student" + "student > minority", then the resulting DPO dataset has ≥ 5 pairs/trace and a non-trivial KL distance from random pairs. | The reward shape needs to actually carry signal, not just exist. Spike 005 already verified the extraction logic; spike 003 measures signal density on real traces. | 📋 planned |
| 004 | 004-ab-train-grpo-vs-trace-replay-dpo |
Given the trace dataset from 002, when we train two Qwen3-7B variants — (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO — and evaluate on SWE-bench-lite, then variant (B) outperforms (A) by ≥ 2 pt pass@1 with statistical significance. | The terminal experiment that validates or invalidates the v0.0 claim. | 📋 planned |
Spike order rationale
- 001 (teacher cost) first — single most likely thing to kill the framework. Cheap to run (~$5–20), takes ~1 hour, no GPU.
- 002a / 002b in parallel — independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head.
- 003 reward-shape check — once we have any trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run.
- 004 the actual experiment — only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate.
Out of scope for v0.0 (deferred to v0.1)
- Composer hint-distill = SDPO/OPSD (per-turn KL from a hint-conditioned forward pass). Cursor's secret sauce. Code is published at github.com/siyan-zhao/OPSD; paper arXiv:2601.20802. Lift the loss for v0.1 — see
docs/COMPOSER_RECIPE_MAPPING.md§ "Implementation handles for v0.1" for the concrete plan. - The Feature Deletion environment (use SWE-bench-lite as the env in v0.0)
- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
- MoE base (use dense Qwen3-7B; saner v0.0 target)
- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
Why deferring SDPO/hint-distill to v0.1 is the right call:
- The novel claim is trace-replay (channel 3). The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
- The hint-generator open question (templates vs. LLM-driven hints) is unresolved. v0.0 with hardcoded tool-call templates only validates the easy case.
- Spike 001's economic verdict gates only the trace-replay channel. SDPO has no per-step API cost.
- A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
v0.1 will run a 3-arm A/B: RLVR vs. RLVR + SDPO vs. RLVR + SDPO + trace-replay-DPO at 32B once we know v0.0's trace-replay verdict.
Budget
| Item | Estimate | Source |
|---|---|---|
| Teacher API calls (OpenRouter) | ~$50–150 | 100 traces × ~50 step replays × 3 teachers × ~$0.005/call |
| GPU compute (Qwen3-7B fine-tune × 2 variants) | ~$60–120 | Modal A100-80GB, ~8 hr each variant |
| Dev wallclock | ~5–7 days | Single operator |
| Total | ~$200 + dev time | Cheapest viable falsification of the novel claim |
Success criteria for v0.0
- 001: $/trace + s/step verdict in
001-teacher-replay-cost/README.md - 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1
- 003: DPO-pair stats verdict
- 004: A/B pass@1 with confidence interval, plain text and chart
If 004 is VALIDATED → publish the result, write v0.1 plan. If PARTIAL (e.g., only some teacher mixes work) → narrow the claim, re-spike with the working subset. If INVALIDATED → close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe.
Citations
All five primary research notes (research/01..05*.md) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time:
- Cursor (2026): Composer 2.5 blog post — recipe shape and the targeted-RL hint-distillation idea
- Microsoft (2024): rStar / rStar-Math — closest precedent to trace-replay (single-teacher MCTS)
- Hugging Face (2025): TRL
GRPOTrainer+ OpenEnv integration — algorithm reference - Prime Intellect (2026): PRIME-RL + INTELLECT-2 — production decentralized substrate