composer-replication-framework / framework /composer-replication-framework.md

baladithyab

Wave 4: data collator + loss composition smoke (38/38 tests pass)

157cdba 11 days ago

20.1 kB

Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL

Status: Research synthesis (2026-05-25). Pre-spike. No code yet. Goal: Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel trace-replay multi-teacher distillation signal. Underlying research: see ~/wiki/research/post-training-framework/{01..05}*.md (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking).

TL;DR

Component	Decision	Rationale
Base model	HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B)	Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1
Algorithm core	GRPO + DAPO patches + Composer-style on-policy distillation hint loss	DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce
Training framework	PRIME-RL (Prime Intellect) as substrate; TRL for algorithm correctness; borrow VeRL's 3D-HybridEngine patterns	PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference
Distributed sync	PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST) for v0.1; bolt on Streaming DiLoCo outer loop only when scaling beyond one cluster	DiLoCo isn't useful when training fits one node. Add it when going multi-DC.
Environments	OpenEnv + verifiers (Hub-hosted) with Cursor-style "Anyrun" sandboxes	OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing
Reward signal	Three-channel: (1) RLVR (tests pass), (2) Composer hint-distill = SDPO/OPSD (single-model, hint-conditioned self-teacher), (3) Trace-replay multi-teacher PRM (your novel idea — N external teachers)	Composer's hint-distill is published as SDPO (arXiv:2601.20802 + code); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. They are TWO different mechanisms, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction.
Trace-replay novelty	Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher frozen-trace replay is open territory	Worth publishing if it works
Orchestration	Monarch (when it matures) or Ray (today) for the actor mesh; OpenEnv for the env contract	Forge has been "development-paused" — borrow patterns, don't depend on it

What Composer 2.5 actually is, and what we're trying to replicate

From 01-composer-2.5.md:

Base: Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx.
85% of total compute is post-training. Pretraining is just the cheap starting point.
The recipe (5 stages):
1. Continued pretraining on heavily code-weighted data. Lower pretraining loss → better downstream RL.
2. Synthetic data at scale — 25× more synthetic tasks vs Composer 2. The headline trick: "Feature Deletion" — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward.
3. Realistic environment RL — async sandboxes (their "Anyrun" system) with the exact same tool harness the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits.
4. 🔑 Targeted RL with textual feedback (on-policy distillation). When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor:
  - Generates a text hint correcting the error
  - Inserts the hint at the error turn
  - Runs forward pass with hint → "Teacher" logits
  - Runs forward pass without hint → "Student" logits
  - Applies KL divergence loss to pull Student toward Teacher only at that turn
  - This sidesteps the credit-assignment nightmare of long-horizon scalar rewards
5. Sharded Muon + Dual Mesh HSDP — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell.
Result: ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers).

Replicating this means cloning stages 1-4. Stage 5 is just MLOps. And step 4 — the hint-distillation trick — is the least obvious and probably the most important.

How the 5 component pieces fit together

For the rigorous integration architecture — exact extension points in TRL (GRPOTrainer._compute_loss subclass), VeRL (@register_adv_est + DataProto), the OPSD loss generalized_jsd_loss lifted from siyan-zhao/OPSD, and the per-channel sequence diagrams — see docs/INTEGRATION_ARCHITECTURE.md. A working code skeleton with 38 passing unit tests verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at spikes/005-integrated-trainer-skeleton/.

The high-level topology:

                    ┌───────────────────────────────────────────┐
                    │           OpenEnv Environment Hub         │
                    │  (HF Hub, Docker images, MCP tool-calling)│
                    │  - Anyrun-style code sandbox              │
                    │  - SWE-Gym, SWE-Bench-Verified envs       │
                    │  - "Feature Deletion" auto-grader env     │
                    └────────────────┬──────────────────────────┘
                                     │ rollouts (verifiers protocol)
                                     ▼
        ┌────────────────────────────────────────────────────────────┐
        │                    ORCHESTRATOR (CPU)                      │
        │  - Schedules rollouts across inference workers            │
        │  - Assembles training batches                             │
        │  - Routes hint-distillation pairs (Composer-style)        │
        │  - Routes trace-replay teacher queries (NOVEL)            │
        │  - Built on Monarch (future) or Ray (today)               │
        └────┬──────────────────────────┬──────────────────────────┬─┘
             │ rollout requests         │ training batches         │ teacher queries
             ▼                          ▼                          ▼
   ┌─────────────────────┐   ┌────────────────────┐   ┌────────────────────────┐
   │  INFERENCE POOL     │   │  TRAINER (GPU)     │   │  TEACHER POOL          │
   │  (vLLM / SGLang)    │   │  - FSDP2 sharded   │   │  - Frozen N teachers   │
   │  - Student policy   │   │  - GRPO + DAPO     │   │  - HF Inference,       │
   │  - Auto-resharded   │   │  - +Hint distill   │   │    OpenRouter, vLLM    │
   │    via SHARDCAST    │   │    KL loss         │   │  - Diverse families    │
   │  - Async tool waits │   │  - +PRM/DPO from   │   │    (Anthropic / OpenAI │
   │    don't block GPU  │   │    trace-replay    │   │     / DeepSeek / Qwen) │
   └─────────────────────┘   └────────────────────┘   └────────────────────────┘
                                     │
                                     │ pseudo-gradients (every H steps)
                                     ▼
                    ┌────────────────────────────────┐
                    │  OUTER LOOP (DiLoCo, optional) │
                    │  - Only when training spans    │
                    │    multiple clusters / DCs     │
                    │  - Streaming variant for       │
                    │    bandwidth-limited links     │
                    └────────────────────────────────┘

Why this stack

PRIME-RL is the right substrate (02-diloco-family.md). It's the only framework that already implements the orchestrator/trainer/inference split for RL with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their verifiers library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift.

TRL provides the cleanest algorithm reference (04-verl-trl.md). GRPOTrainer, OnlineDPOTrainer, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the loss math from TRL but run on PRIME-RL's distributed substrate.

VeRL's 3D-HybridEngine is the production benchmark for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework.

Monarch + OpenEnv is the future bet, Ray + verifiers is today (03-monarch-torchforge-openenv.md). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature.

DiLoCo is dormant infra until we scale beyond one cluster. Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop across data centers. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual trainer is still single-cluster FSDP2. We'd add Streaming DiLoCo only when:

Training compute exceeds one cluster, OR
We're recruiting volunteer compute (INTELLECT-1 model)

For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer.

Your trace-replay distillation idea: where it fits

From 05-trace-replay-distillation.md:

No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the frozen-trace replay mechanism is new territory.

The closest published precedents:

Work	What they do	What you'd add
rStar / rStar-Math (Microsoft)	MCTS at training time, single teacher branches at each step	Replay pre-existing traces, multiple teachers, no MCTS at training time
Math-Shepherd / OmegaPRM	Process reward models from rollout-and-check	Step-level teacher disagreement as the reward signal
Magpie / OpenThoughts	Synthetic data from one strong teacher	Per-step distillation from N teachers on real traces
MoA (Mixture of Agents)	Multi-teacher response-level aggregation	Per-step (sub-response) aggregation

The novel claim:

Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them).
At each step t, replay the exact same state with N frozen teachers.
Get N candidate action_t distributions.
Use disagreement / agreement as a per-step reward signal for the student model.

This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea. Composer's hint-distill (= SDPO / OPSD, arXiv:2601.20802 + arXiv:2601.18734, code at github.com/siyan-zhao/OPSD) uses a single model as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses N external pretrained models as teachers. Together:

Composer's hint-loss = same-model self-teacher with hint context pulls student at error sites (~1 extra forward pass / cheap, no API)
Trace-replay-loss = N external teachers pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)

These are complementary, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See docs/COMPOSER_RECIPE_MAPPING.md for the precise mathematical distinction and the implementation-handle table.

Cost mitigation (the report does this analysis well):

VOI gating (only query teachers when student entropy is high) → 60-80% savings
Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings
Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline

Reward shape options (also in the report):

Plurality vote (binary, simple)
Weighted consensus
DPO preference pairs ← recommended for v0.1: avoids reward model
Variance-weighted (uncertainty-aware)
Trained PRM ← recommended for production: amortizes cost

Proposed phase plan

v0.0 — proof of concept (1-2 weeks)

Goal: Prove the trace-replay-distillation channel adds signal on top of plain GRPO.

Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B
Use TRL's GRPOTrainer directly, no decentralization yet
Environment: a single OpenEnv-compatible task (start with swe-bench-lite via verifiers, or stand up the "Feature Deletion" env on a small repo)
Trace source: 100 student rollouts, frozen as JSON
Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have)
Reward channel: DPO pairs from teacher-disagreement at step level
A/B comparison: plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost.
Skip Composer hint-distill and DiLoCo for now — those are v0.1+.

v0.1 — Composer-style recipe (1-2 months)

Goal: All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment.

Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference
Build the "Feature Deletion" env as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact)
Implement the hint-distillation loss: error detector → text hint generator → KL distill at error turns
Bake in trace-replay-DPO as the third channel
Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE)
Single cluster, no DiLoCo
Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale

v0.2 — decentralized scaling (3-6 months)

Goal: Run the v0.1 recipe across multiple clusters / volunteer compute.

Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync
Add SHARDCAST for inference-pool weight broadcast across DCs
Add TOPLOC-style verifiable inference if running with untrusted workers
Migrate orchestration from Ray to Monarch when Monarch's K8s story matures
Migrate environment hosting from inline-Docker to OpenEnv Hub
Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods

Open questions I'd want answered before starting

Hint generator architecture — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike.
Trace data source — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this.
Teacher diversity vs cost — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate.
Hardware target for v0.1 — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine.
MoE vs dense — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target.

What we should NOT do

Don't build on TorchForge. Meta paused it. Lift patterns, not dependencies.
Don't try to replicate Composer's exact training mix. ~85% of their compute is post-training; you don't have that budget. Replicate the recipe shape, not the scale.
Don't add DiLoCo before you need it. Single-cluster training is fine until token budget says otherwise.
Don't forget the reward-hacking safeguards. Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1.
Don't skip RLVR ground-truth. The trace-replay channel is additional signal, not a replacement for "tests pass."

Sources

All five research notes:

~/wiki/research/post-training-framework/01-composer-2.5.md (Cursor recipe deep-dive)
~/wiki/research/post-training-framework/02-diloco-family.md (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2)
~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md (Meta's stack)
~/wiki/research/post-training-framework/04-verl-trl.md (algorithm libraries)
~/wiki/research/post-training-framework/05-trace-replay-distillation.md (your novelty assessment)

Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports:

GRPO+DAPO is the consensus algorithm (3/4 reports, the 4th doesn't compare)
PRIME-RL is the most production-ready decentralized substrate (2 reports independently)
OpenEnv is the env-format winner (3 reports converge)
Trace-replay-with-N-teachers is genuinely under-explored (the trace-replay report's primary finding)

Next-step decision

Three paths from here:

Spike v0.0 — skill_view('spike') then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value.
Plan first — skill_view('writing-plans') then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job.
Deeper research first — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code.

My recommendation is (1) Spike v0.0, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish.