File size: 20,147 Bytes

# Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL

> **Status:** Research synthesis (2026-05-25). Pre-spike. No code yet.
> **Goal:** Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel **trace-replay multi-teacher distillation** signal.
> **Underlying research:** see `~/wiki/research/post-training-framework/{01..05}*.md` (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking).

## TL;DR

| Component | Decision | Rationale |
|---|---|---|
| **Base model** | HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B) | Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1 |
| **Algorithm core** | GRPO + DAPO patches + Composer-style **on-policy distillation hint loss** | DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce |
| **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference |
| **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. |
| **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing |
| **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. |
| **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works |
| **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it |

## What Composer 2.5 actually is, and what we're trying to replicate

From `01-composer-2.5.md`:

- **Base:** Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx.
- **85% of total compute is post-training.** Pretraining is just the cheap starting point.
- **The recipe (5 stages):**
  1. **Continued pretraining** on heavily code-weighted data. Lower pretraining loss → better downstream RL.
  2. **Synthetic data at scale** — 25× more synthetic tasks vs Composer 2. The headline trick: **"Feature Deletion"** — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward.
  3. **Realistic environment RL** — async sandboxes (their "Anyrun" system) with the *exact same tool harness* the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits.
  4. **🔑 Targeted RL with textual feedback (on-policy distillation).** When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor:
     - Generates a text hint correcting the error
     - Inserts the hint at the error turn
     - Runs forward pass with hint → "Teacher" logits
     - Runs forward pass without hint → "Student" logits
     - Applies KL divergence loss to pull Student toward Teacher *only at that turn*
     - This sidesteps the credit-assignment nightmare of long-horizon scalar rewards
  5. **Sharded Muon + Dual Mesh HSDP** — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell.
- **Result:** ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers).

**Replicating this means cloning stages 1-4. Stage 5 is just MLOps.** And step 4 — the hint-distillation trick — is the *least obvious* and probably the most important.

## How the 5 component pieces fit together

For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).

The high-level topology:

```
                    ┌───────────────────────────────────────────┐
                    │           OpenEnv Environment Hub         │
                    │  (HF Hub, Docker images, MCP tool-calling)│
                    │  - Anyrun-style code sandbox              │
                    │  - SWE-Gym, SWE-Bench-Verified envs       │
                    │  - "Feature Deletion" auto-grader env     │
                    └────────────────┬──────────────────────────┘
                                     │ rollouts (verifiers protocol)
                                     ▼
        ┌────────────────────────────────────────────────────────────┐
        │                    ORCHESTRATOR (CPU)                      │
        │  - Schedules rollouts across inference workers            │
        │  - Assembles training batches                             │
        │  - Routes hint-distillation pairs (Composer-style)        │
        │  - Routes trace-replay teacher queries (NOVEL)            │
        │  - Built on Monarch (future) or Ray (today)               │
        └────┬──────────────────────────┬──────────────────────────┬─┘
             │ rollout requests         │ training batches         │ teacher queries
             ▼                          ▼                          ▼
   ┌─────────────────────┐   ┌────────────────────┐   ┌────────────────────────┐
   │  INFERENCE POOL     │   │  TRAINER (GPU)     │   │  TEACHER POOL          │
   │  (vLLM / SGLang)    │   │  - FSDP2 sharded   │   │  - Frozen N teachers   │
   │  - Student policy   │   │  - GRPO + DAPO     │   │  - HF Inference,       │
   │  - Auto-resharded   │   │  - +Hint distill   │   │    OpenRouter, vLLM    │
   │    via SHARDCAST    │   │    KL loss         │   │  - Diverse families    │
   │  - Async tool waits │   │  - +PRM/DPO from   │   │    (Anthropic / OpenAI │
   │    don't block GPU  │   │    trace-replay    │   │     / DeepSeek / Qwen) │
   └─────────────────────┘   └────────────────────┘   └────────────────────────┘
                                     │
                                     │ pseudo-gradients (every H steps)
                                     ▼
                    ┌────────────────────────────────┐
                    │  OUTER LOOP (DiLoCo, optional) │
                    │  - Only when training spans    │
                    │    multiple clusters / DCs     │
                    │  - Streaming variant for       │
                    │    bandwidth-limited links     │
                    └────────────────────────────────┘
```

### Why this stack

**PRIME-RL is the right substrate** (`02-diloco-family.md`). It's the only framework that already implements the orchestrator/trainer/inference split *for RL* with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their `verifiers` library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift.

**TRL provides the cleanest algorithm reference** (`04-verl-trl.md`). `GRPOTrainer`, `OnlineDPOTrainer`, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the *loss math* from TRL but run on PRIME-RL's distributed substrate.

**VeRL's 3D-HybridEngine is the production benchmark** for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework.

**Monarch + OpenEnv is the future bet, Ray + verifiers is today** (`03-monarch-torchforge-openenv.md`). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature.

**DiLoCo is dormant infra until we scale beyond one cluster.** Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop *across data centers*. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual *trainer* is still single-cluster FSDP2. We'd add Streaming DiLoCo only when:
- Training compute exceeds one cluster, OR
- We're recruiting volunteer compute (INTELLECT-1 model)

For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer.

## Your trace-replay distillation idea: where it fits

From `05-trace-replay-distillation.md`:

> No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the **frozen-trace replay mechanism** is new territory.

**The closest published precedents:**

| Work | What they do | What you'd add |
|---|---|---|
| **rStar / rStar-Math** (Microsoft) | MCTS at training time, single teacher branches at each step | Replay pre-existing traces, *multiple* teachers, no MCTS at training time |
| **Math-Shepherd / OmegaPRM** | Process reward models from rollout-and-check | Step-level *teacher disagreement* as the reward signal |
| **Magpie / OpenThoughts** | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces |
| **MoA (Mixture of Agents)** | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation |

**The novel claim:**
1. Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them).
2. At each step `t`, replay the *exact same state* with N frozen teachers.
3. Get N candidate `action_t` distributions.
4. Use disagreement / agreement as a **per-step reward signal** for the student model.

**This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together:

- Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API)
- Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)

These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.

**Cost mitigation** (the report does this analysis well):
- VOI gating (only query teachers when student entropy is high) → 60-80% savings
- Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings
- Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline

**Reward shape options** (also in the report):
1. Plurality vote (binary, simple)
2. Weighted consensus
3. **DPO preference pairs** ← recommended for v0.1: avoids reward model
4. Variance-weighted (uncertainty-aware)
5. **Trained PRM** ← recommended for production: amortizes cost

## Proposed phase plan

### v0.0 — proof of concept (1-2 weeks)

**Goal:** Prove the trace-replay-distillation channel adds signal on top of plain GRPO.

- Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B
- Use TRL's `GRPOTrainer` directly, no decentralization yet
- Environment: a single OpenEnv-compatible task (start with `swe-bench-lite` via verifiers, or stand up the "Feature Deletion" env on a small repo)
- Trace source: 100 student rollouts, frozen as JSON
- Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have)
- Reward channel: DPO pairs from teacher-disagreement at step level
- **A/B comparison:** plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost.
- Skip Composer hint-distill and DiLoCo for now — those are v0.1+.

### v0.1 — Composer-style recipe (1-2 months)

**Goal:** All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment.

- Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference
- Build the **"Feature Deletion" env** as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact)
- Implement the **hint-distillation loss**: error detector → text hint generator → KL distill at error turns
- Bake in **trace-replay-DPO** as the third channel
- Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE)
- Single cluster, no DiLoCo
- Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale

### v0.2 — decentralized scaling (3-6 months)

**Goal:** Run the v0.1 recipe across multiple clusters / volunteer compute.

- Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync
- Add SHARDCAST for inference-pool weight broadcast across DCs
- Add TOPLOC-style verifiable inference if running with untrusted workers
- Migrate orchestration from Ray to Monarch when Monarch's K8s story matures
- Migrate environment hosting from inline-Docker to OpenEnv Hub
- Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods

## Open questions I'd want answered before starting

1. **Hint generator architecture** — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike.
2. **Trace data source** — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this.
3. **Teacher diversity vs cost** — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate.
4. **Hardware target for v0.1** — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine.
5. **MoE vs dense** — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target.

## What we should NOT do

- **Don't build on TorchForge.** Meta paused it. Lift patterns, not dependencies.
- **Don't try to replicate Composer's exact training mix.** ~85% of their compute is post-training; you don't have that budget. Replicate the *recipe shape*, not the scale.
- **Don't add DiLoCo before you need it.** Single-cluster training is fine until token budget says otherwise.
- **Don't forget the reward-hacking safeguards.** Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1.
- **Don't skip RLVR ground-truth.** The trace-replay channel is *additional signal*, not a replacement for "tests pass."

## Sources

All five research notes:
- `~/wiki/research/post-training-framework/01-composer-2.5.md` (Cursor recipe deep-dive)
- `~/wiki/research/post-training-framework/02-diloco-family.md` (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2)
- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` (Meta's stack)
- `~/wiki/research/post-training-framework/04-verl-trl.md` (algorithm libraries)
- `~/wiki/research/post-training-framework/05-trace-replay-distillation.md` (your novelty assessment)

Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports:
- **GRPO+DAPO is the consensus algorithm** (3/4 reports, the 4th doesn't compare)
- **PRIME-RL is the most production-ready decentralized substrate** (2 reports independently)
- **OpenEnv is the env-format winner** (3 reports converge)
- **Trace-replay-with-N-teachers is genuinely under-explored** (the trace-replay report's primary finding)

## Next-step decision

Three paths from here:

1. **Spike v0.0** — `skill_view('spike')` then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value.
2. **Plan first** — `skill_view('writing-plans')` then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job.
3. **Deeper research first** — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code.

My recommendation is **(1) Spike v0.0**, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish.