Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 20,147 Bytes
7165832 1cede23 7165832 157cdba fd77f74 7165832 1cede23 7165832 1cede23 7165832 1cede23 7165832 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | # Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL
> **Status:** Research synthesis (2026-05-25). Pre-spike. No code yet.
> **Goal:** Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel **trace-replay multi-teacher distillation** signal.
> **Underlying research:** see `~/wiki/research/post-training-framework/{01..05}*.md` (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking).
## TL;DR
| Component | Decision | Rationale |
|---|---|---|
| **Base model** | HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B) | Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1 |
| **Algorithm core** | GRPO + DAPO patches + Composer-style **on-policy distillation hint loss** | DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce |
| **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
| **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
| **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. |
| **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
| **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |
## What Composer 2.5 actually is, and what we're trying to replicate
From `01-composer-2.5.md`:
- **Base:** Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx.
- **85% of total compute is post-training.** Pretraining is just the cheap starting point.
- **The recipe (5 stages):**
1. **Continued pretraining** on heavily code-weighted data. Lower pretraining loss → better downstream RL.
2. **Synthetic data at scale** — 25× more synthetic tasks vs Composer 2. The headline trick: **"Feature Deletion"** — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward.
3. **Realistic environment RL** — async sandboxes (their "Anyrun" system) with the *exact same tool harness* the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits.
4. **🔑 Targeted RL with textual feedback (on-policy distillation).** When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor:
- Generates a text hint correcting the error
- Inserts the hint at the error turn
- Runs forward pass with hint → "Teacher" logits
- Runs forward pass without hint → "Student" logits
- Applies KL divergence loss to pull Student toward Teacher *only at that turn*
- This sidesteps the credit-assignment nightmare of long-horizon scalar rewards
5. **Sharded Muon + Dual Mesh HSDP** — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell.
- **Result:** ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers).
**Replicating this means cloning stages 1-4. Stage 5 is just MLOps.** And step 4 — the hint-distillation trick — is the *least obvious* and probably the most important.
## How the 5 component pieces fit together
For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).
The high-level topology:
```
┌───────────────────────────────────────────┐
│ OpenEnv Environment Hub │
│ (HF Hub, Docker images, MCP tool-calling)│
│ - Anyrun-style code sandbox │
│ - SWE-Gym, SWE-Bench-Verified envs │
│ - "Feature Deletion" auto-grader env │
└────────────────┬──────────────────────────┘
│ rollouts (verifiers protocol)
▼
┌────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (CPU) │
│ - Schedules rollouts across inference workers │
│ - Assembles training batches │
│ - Routes hint-distillation pairs (Composer-style) │
│ - Routes trace-replay teacher queries (NOVEL) │
│ - Built on Monarch (future) or Ray (today) │
└────┬──────────────────────────┬──────────────────────────┬─┘
│ rollout requests │ training batches │ teacher queries
▼ ▼ ▼
┌─────────────────────┐ ┌────────────────────┐ ┌────────────────────────┐
│ INFERENCE POOL │ │ TRAINER (GPU) │ │ TEACHER POOL │
│ (vLLM / SGLang) │ │ - FSDP2 sharded │ │ - Frozen N teachers │
│ - Student policy │ │ - GRPO + DAPO │ │ - HF Inference, │
│ - Auto-resharded │ │ - +Hint distill │ │ OpenRouter, vLLM │
│ via SHARDCAST │ │ KL loss │ │ - Diverse families │
│ - Async tool waits │ │ - +PRM/DPO from │ │ (Anthropic / OpenAI │
│ don't block GPU │ │ trace-replay │ │ / DeepSeek / Qwen) │
└─────────────────────┘ └────────────────────┘ └────────────────────────┘
│
│ pseudo-gradients (every H steps)
▼
┌────────────────────────────────┐
│ OUTER LOOP (DiLoCo, optional) │
│ - Only when training spans │
│ multiple clusters / DCs │
│ - Streaming variant for │
│ bandwidth-limited links │
└────────────────────────────────┘
```
### Why this stack
**PRIME-RL is the right substrate** (`02-diloco-family.md`). It's the only framework that already implements the orchestrator/trainer/inference split *for RL* with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their `verifiers` library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift.
**TRL provides the cleanest algorithm reference** (`04-verl-trl.md`). `GRPOTrainer`, `OnlineDPOTrainer`, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the *loss math* from TRL but run on PRIME-RL's distributed substrate.
**VeRL's 3D-HybridEngine is the production benchmark** for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework.
**Monarch + OpenEnv is the future bet, Ray + verifiers is today** (`03-monarch-torchforge-openenv.md`). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature.
**DiLoCo is dormant infra until we scale beyond one cluster.** Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop *across data centers*. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual *trainer* is still single-cluster FSDP2. We'd add Streaming DiLoCo only when:
- Training compute exceeds one cluster, OR
- We're recruiting volunteer compute (INTELLECT-1 model)
For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer.
## Your trace-replay distillation idea: where it fits
From `05-trace-replay-distillation.md`:
> No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the **frozen-trace replay mechanism** is new territory.
**The closest published precedents:**
| Work | What they do | What you'd add |
|---|---|---|
| **rStar / rStar-Math** (Microsoft) | MCTS at training time, single teacher branches at each step | Replay pre-existing traces, *multiple* teachers, no MCTS at training time |
| **Math-Shepherd / OmegaPRM** | Process reward models from rollout-and-check | Step-level *teacher disagreement* as the reward signal |
| **Magpie / OpenThoughts** | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces |
| **MoA (Mixture of Agents)** | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation |
**The novel claim:**
1. Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them).
2. At each step `t`, replay the *exact same state* with N frozen teachers.
3. Get N candidate `action_t` distributions.
4. Use disagreement / agreement as a **per-step reward signal** for the student model.
**This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together:
- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)
These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.
**Cost mitigation** (the report does this analysis well):
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
- Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings
- Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline
**Reward shape options** (also in the report):
1. Plurality vote (binary, simple)
2. Weighted consensus
3. **DPO preference pairs** ← recommended for v0.1: avoids reward model
4. Variance-weighted (uncertainty-aware)
5. **Trained PRM** ← recommended for production: amortizes cost
## Proposed phase plan
### v0.0 — proof of concept (1-2 weeks)
**Goal:** Prove the trace-replay-distillation channel adds signal on top of plain GRPO.
- Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B
- Use TRL's `GRPOTrainer` directly, no decentralization yet
- Environment: a single OpenEnv-compatible task (start with `swe-bench-lite` via verifiers, or stand up the "Feature Deletion" env on a small repo)
- Trace source: 100 student rollouts, frozen as JSON
- Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have)
- Reward channel: DPO pairs from teacher-disagreement at step level
- **A/B comparison:** plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost.
- Skip Composer hint-distill and DiLoCo for now — those are v0.1+.
### v0.1 — Composer-style recipe (1-2 months)
**Goal:** All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment.
- Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference
- Build the **"Feature Deletion" env** as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact)
- Implement the **hint-distillation loss**: error detector → text hint generator → KL distill at error turns
- Bake in **trace-replay-DPO** as the third channel
- Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE)
- Single cluster, no DiLoCo
- Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale
### v0.2 — decentralized scaling (3-6 months)
**Goal:** Run the v0.1 recipe across multiple clusters / volunteer compute.
- Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync
- Add SHARDCAST for inference-pool weight broadcast across DCs
- Add TOPLOC-style verifiable inference if running with untrusted workers
- Migrate orchestration from Ray to Monarch when Monarch's K8s story matures
- Migrate environment hosting from inline-Docker to OpenEnv Hub
- Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods
## Open questions I'd want answered before starting
1. **Hint generator architecture** — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike.
2. **Trace data source** — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this.
3. **Teacher diversity vs cost** — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate.
4. **Hardware target for v0.1** — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine.
5. **MoE vs dense** — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target.
## What we should NOT do
- **Don't build on TorchForge.** Meta paused it. Lift patterns, not dependencies.
- **Don't try to replicate Composer's exact training mix.** ~85% of their compute is post-training; you don't have that budget. Replicate the *recipe shape*, not the scale.
- **Don't add DiLoCo before you need it.** Single-cluster training is fine until token budget says otherwise.
- **Don't forget the reward-hacking safeguards.** Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1.
- **Don't skip RLVR ground-truth.** The trace-replay channel is *additional signal*, not a replacement for "tests pass."
## Sources
All five research notes:
- `~/wiki/research/post-training-framework/01-composer-2.5.md` (Cursor recipe deep-dive)
- `~/wiki/research/post-training-framework/02-diloco-family.md` (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2)
- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` (Meta's stack)
- `~/wiki/research/post-training-framework/04-verl-trl.md` (algorithm libraries)
- `~/wiki/research/post-training-framework/05-trace-replay-distillation.md` (your novelty assessment)
Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports:
- **GRPO+DAPO is the consensus algorithm** (3/4 reports, the 4th doesn't compare)
- **PRIME-RL is the most production-ready decentralized substrate** (2 reports independently)
- **OpenEnv is the env-format winner** (3 reports converge)
- **Trace-replay-with-N-teachers is genuinely under-explored** (the trace-replay report's primary finding)
## Next-step decision
Three paths from here:
1. **Spike v0.0** — `skill_view('spike')` then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value.
2. **Plan first** — `skill_view('writing-plans')` then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job.
3. **Deeper research first** — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code.
My recommendation is **(1) Spike v0.0**, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish.
|