# Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL > **Status:** Research synthesis (2026-05-25). Pre-spike. No code yet. > **Goal:** Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel **trace-replay multi-teacher distillation** signal. > **Underlying research:** see `~/wiki/research/post-training-framework/{01..05}*.md` (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking). ## TL;DR | Component | Decision | Rationale | |---|---|---| | **Base model** | HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B) | Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1 | | **Algorithm core** | GRPO + DAPO patches + Composer-style **on-policy distillation hint loss** | DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce | | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference | | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. | | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing | | **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. | | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works | | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it | ## What Composer 2.5 actually is, and what we're trying to replicate From `01-composer-2.5.md`: - **Base:** Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx. - **85% of total compute is post-training.** Pretraining is just the cheap starting point. - **The recipe (5 stages):** 1. **Continued pretraining** on heavily code-weighted data. Lower pretraining loss → better downstream RL. 2. **Synthetic data at scale** — 25× more synthetic tasks vs Composer 2. The headline trick: **"Feature Deletion"** — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward. 3. **Realistic environment RL** — async sandboxes (their "Anyrun" system) with the *exact same tool harness* the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits. 4. **🔑 Targeted RL with textual feedback (on-policy distillation).** When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor: - Generates a text hint correcting the error - Inserts the hint at the error turn - Runs forward pass with hint → "Teacher" logits - Runs forward pass without hint → "Student" logits - Applies KL divergence loss to pull Student toward Teacher *only at that turn* - This sidesteps the credit-assignment nightmare of long-horizon scalar rewards 5. **Sharded Muon + Dual Mesh HSDP** — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell. - **Result:** ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers). **Replicating this means cloning stages 1-4. Stage 5 is just MLOps.** And step 4 — the hint-distillation trick — is the *least obvious* and probably the most important. ## How the 5 component pieces fit together For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/). The high-level topology: ``` ┌───────────────────────────────────────────┐ │ OpenEnv Environment Hub │ │ (HF Hub, Docker images, MCP tool-calling)│ │ - Anyrun-style code sandbox │ │ - SWE-Gym, SWE-Bench-Verified envs │ │ - "Feature Deletion" auto-grader env │ └────────────────┬──────────────────────────┘ │ rollouts (verifiers protocol) ▼ ┌────────────────────────────────────────────────────────────┐ │ ORCHESTRATOR (CPU) │ │ - Schedules rollouts across inference workers │ │ - Assembles training batches │ │ - Routes hint-distillation pairs (Composer-style) │ │ - Routes trace-replay teacher queries (NOVEL) │ │ - Built on Monarch (future) or Ray (today) │ └────┬──────────────────────────┬──────────────────────────┬─┘ │ rollout requests │ training batches │ teacher queries ▼ ▼ ▼ ┌─────────────────────┐ ┌────────────────────┐ ┌────────────────────────┐ │ INFERENCE POOL │ │ TRAINER (GPU) │ │ TEACHER POOL │ │ (vLLM / SGLang) │ │ - FSDP2 sharded │ │ - Frozen N teachers │ │ - Student policy │ │ - GRPO + DAPO │ │ - HF Inference, │ │ - Auto-resharded │ │ - +Hint distill │ │ OpenRouter, vLLM │ │ via SHARDCAST │ │ KL loss │ │ - Diverse families │ │ - Async tool waits │ │ - +PRM/DPO from │ │ (Anthropic / OpenAI │ │ don't block GPU │ │ trace-replay │ │ / DeepSeek / Qwen) │ └─────────────────────┘ └────────────────────┘ └────────────────────────┘ │ │ pseudo-gradients (every H steps) ▼ ┌────────────────────────────────┐ │ OUTER LOOP (DiLoCo, optional) │ │ - Only when training spans │ │ multiple clusters / DCs │ │ - Streaming variant for │ │ bandwidth-limited links │ └────────────────────────────────┘ ``` ### Why this stack **PRIME-RL is the right substrate** (`02-diloco-family.md`). It's the only framework that already implements the orchestrator/trainer/inference split *for RL* with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their `verifiers` library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift. **TRL provides the cleanest algorithm reference** (`04-verl-trl.md`). `GRPOTrainer`, `OnlineDPOTrainer`, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the *loss math* from TRL but run on PRIME-RL's distributed substrate. **VeRL's 3D-HybridEngine is the production benchmark** for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework. **Monarch + OpenEnv is the future bet, Ray + verifiers is today** (`03-monarch-torchforge-openenv.md`). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature. **DiLoCo is dormant infra until we scale beyond one cluster.** Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop *across data centers*. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual *trainer* is still single-cluster FSDP2. We'd add Streaming DiLoCo only when: - Training compute exceeds one cluster, OR - We're recruiting volunteer compute (INTELLECT-1 model) For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer. ## Your trace-replay distillation idea: where it fits From `05-trace-replay-distillation.md`: > No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the **frozen-trace replay mechanism** is new territory. **The closest published precedents:** | Work | What they do | What you'd add | |---|---|---| | **rStar / rStar-Math** (Microsoft) | MCTS at training time, single teacher branches at each step | Replay pre-existing traces, *multiple* teachers, no MCTS at training time | | **Math-Shepherd / OmegaPRM** | Process reward models from rollout-and-check | Step-level *teacher disagreement* as the reward signal | | **Magpie / OpenThoughts** | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces | | **MoA (Mixture of Agents)** | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation | **The novel claim:** 1. Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them). 2. At each step `t`, replay the *exact same state* with N frozen teachers. 3. Get N candidate `action_t` distributions. 4. Use disagreement / agreement as a **per-step reward signal** for the student model. **This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together: - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API) - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001) These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table. **Cost mitigation** (the report does this analysis well): - VOI gating (only query teachers when student entropy is high) → 60-80% savings - Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings - Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline **Reward shape options** (also in the report): 1. Plurality vote (binary, simple) 2. Weighted consensus 3. **DPO preference pairs** ← recommended for v0.1: avoids reward model 4. Variance-weighted (uncertainty-aware) 5. **Trained PRM** ← recommended for production: amortizes cost ## Proposed phase plan ### v0.0 — proof of concept (1-2 weeks) **Goal:** Prove the trace-replay-distillation channel adds signal on top of plain GRPO. - Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B - Use TRL's `GRPOTrainer` directly, no decentralization yet - Environment: a single OpenEnv-compatible task (start with `swe-bench-lite` via verifiers, or stand up the "Feature Deletion" env on a small repo) - Trace source: 100 student rollouts, frozen as JSON - Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have) - Reward channel: DPO pairs from teacher-disagreement at step level - **A/B comparison:** plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost. - Skip Composer hint-distill and DiLoCo for now — those are v0.1+. ### v0.1 — Composer-style recipe (1-2 months) **Goal:** All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment. - Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference - Build the **"Feature Deletion" env** as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact) - Implement the **hint-distillation loss**: error detector → text hint generator → KL distill at error turns - Bake in **trace-replay-DPO** as the third channel - Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE) - Single cluster, no DiLoCo - Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale ### v0.2 — decentralized scaling (3-6 months) **Goal:** Run the v0.1 recipe across multiple clusters / volunteer compute. - Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync - Add SHARDCAST for inference-pool weight broadcast across DCs - Add TOPLOC-style verifiable inference if running with untrusted workers - Migrate orchestration from Ray to Monarch when Monarch's K8s story matures - Migrate environment hosting from inline-Docker to OpenEnv Hub - Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods ## Open questions I'd want answered before starting 1. **Hint generator architecture** — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike. 2. **Trace data source** — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this. 3. **Teacher diversity vs cost** — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate. 4. **Hardware target for v0.1** — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine. 5. **MoE vs dense** — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target. ## What we should NOT do - **Don't build on TorchForge.** Meta paused it. Lift patterns, not dependencies. - **Don't try to replicate Composer's exact training mix.** ~85% of their compute is post-training; you don't have that budget. Replicate the *recipe shape*, not the scale. - **Don't add DiLoCo before you need it.** Single-cluster training is fine until token budget says otherwise. - **Don't forget the reward-hacking safeguards.** Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1. - **Don't skip RLVR ground-truth.** The trace-replay channel is *additional signal*, not a replacement for "tests pass." ## Sources All five research notes: - `~/wiki/research/post-training-framework/01-composer-2.5.md` (Cursor recipe deep-dive) - `~/wiki/research/post-training-framework/02-diloco-family.md` (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2) - `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` (Meta's stack) - `~/wiki/research/post-training-framework/04-verl-trl.md` (algorithm libraries) - `~/wiki/research/post-training-framework/05-trace-replay-distillation.md` (your novelty assessment) Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports: - **GRPO+DAPO is the consensus algorithm** (3/4 reports, the 4th doesn't compare) - **PRIME-RL is the most production-ready decentralized substrate** (2 reports independently) - **OpenEnv is the env-format winner** (3 reports converge) - **Trace-replay-with-N-teachers is genuinely under-explored** (the trace-replay report's primary finding) ## Next-step decision Three paths from here: 1. **Spike v0.0** — `skill_view('spike')` then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value. 2. **Plan first** — `skill_view('writing-plans')` then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job. 3. **Deeper research first** — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code. My recommendation is **(1) Spike v0.0**, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish.