composer-replication-framework / framework /composer-replication-framework.md

baladithyab

Wave 4: data collator + loss composition smoke (38/38 tests pass)

157cdba 13 days ago

20.1 kB

	# Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL

	> Status: Research synthesis (2026-05-25). Pre-spike. No code yet.
	> Goal: Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel trace-replay multi-teacher distillation signal.
	> Underlying research: see `~/wiki/research/post-training-framework/{01..05}*.md` (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking).

	## TL;DR

	\| Component \| Decision \| Rationale \|
	\|---\|---\|---\|
	\| Base model \| HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B) \| Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1 \|
	\| Algorithm core \| GRPO + DAPO patches + Composer-style on-policy distillation hint loss \| DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce \|
	\| Training framework \| PRIME-RL (Prime Intellect) as substrate; TRL for algorithm correctness; borrow VeRL's 3D-HybridEngine patterns \| PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference \|
	\| Distributed sync \| PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST) for v0.1; bolt on Streaming DiLoCo outer loop only when scaling beyond one cluster \| DiLoCo isn't useful when training fits one node. Add it when going multi-DC. \|
	\| Environments \| OpenEnv + verifiers (Hub-hosted) with Cursor-style "Anyrun" sandboxes \| OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing \|
	\| Reward signal \| Three-channel: (1) RLVR (tests pass), (2) Composer hint-distill = SDPO/OPSD (single-model, hint-conditioned self-teacher), (3) Trace-replay multi-teacher PRM (your novel idea — N external teachers) \| Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. They are TWO different mechanisms, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. \|
	\| Trace-replay novelty \| Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher frozen-trace replay is open territory \| Worth publishing if it works \|
	\| Orchestration \| Monarch (when it matures) or Ray (today) for the actor mesh; OpenEnv for the env contract \| Forge has been "development-paused" — borrow patterns, don't depend on it \|

	## What Composer 2.5 actually is, and what we're trying to replicate

	From `01-composer-2.5.md`:

	- Base: Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx.
	- 85% of total compute is post-training. Pretraining is just the cheap starting point.
	- The recipe (5 stages):
	1. Continued pretraining on heavily code-weighted data. Lower pretraining loss → better downstream RL.
	2. Synthetic data at scale — 25× more synthetic tasks vs Composer 2. The headline trick: "Feature Deletion" — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward.
	3. Realistic environment RL — async sandboxes (their "Anyrun" system) with the exact same tool harness the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits.
	4. 🔑 Targeted RL with textual feedback (on-policy distillation). When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor:
	- Generates a text hint correcting the error
	- Inserts the hint at the error turn
	- Runs forward pass with hint → "Teacher" logits
	- Runs forward pass without hint → "Student" logits
	- Applies KL divergence loss to pull Student toward Teacher only at that turn
	- This sidesteps the credit-assignment nightmare of long-horizon scalar rewards
	5. Sharded Muon + Dual Mesh HSDP — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell.
	- Result: ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers).

	Replicating this means cloning stages 1-4. Stage 5 is just MLOps. And step 4 — the hint-distillation trick — is the least obvious and probably the most important.

	## How the 5 component pieces fit together

	For the rigorous integration architecture — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with 38 passing unit tests verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/).

	The high-level topology:

	```
	┌───────────────────────────────────────────┐
	│ OpenEnv Environment Hub │
	│ (HF Hub, Docker images, MCP tool-calling)│
	│ - Anyrun-style code sandbox │
	│ - SWE-Gym, SWE-Bench-Verified envs │
	│ - "Feature Deletion" auto-grader env │
	└────────────────┬──────────────────────────┘
	│ rollouts (verifiers protocol)
	▼
	┌────────────────────────────────────────────────────────────┐
	│ ORCHESTRATOR (CPU) │
	│ - Schedules rollouts across inference workers │
	│ - Assembles training batches │
	│ - Routes hint-distillation pairs (Composer-style) │
	│ - Routes trace-replay teacher queries (NOVEL) │
	│ - Built on Monarch (future) or Ray (today) │
	└────┬──────────────────────────┬──────────────────────────┬─┘
	│ rollout requests │ training batches │ teacher queries
	▼ ▼ ▼
	┌─────────────────────┐ ┌────────────────────┐ ┌────────────────────────┐
	│ INFERENCE POOL │ │ TRAINER (GPU) │ │ TEACHER POOL │
	│ (vLLM / SGLang) │ │ - FSDP2 sharded │ │ - Frozen N teachers │
	│ - Student policy │ │ - GRPO + DAPO │ │ - HF Inference, │
	│ - Auto-resharded │ │ - +Hint distill │ │ OpenRouter, vLLM │
	│ via SHARDCAST │ │ KL loss │ │ - Diverse families │
	│ - Async tool waits │ │ - +PRM/DPO from │ │ (Anthropic / OpenAI │
	│ don't block GPU │ │ trace-replay │ │ / DeepSeek / Qwen) │
	└─────────────────────┘ └────────────────────┘ └────────────────────────┘
	│
	│ pseudo-gradients (every H steps)
	▼
	┌────────────────────────────────┐
	│ OUTER LOOP (DiLoCo, optional) │
	│ - Only when training spans │
	│ multiple clusters / DCs │
	│ - Streaming variant for │
	│ bandwidth-limited links │
	└────────────────────────────────┘
	```

	### Why this stack

	PRIME-RL is the right substrate (`02-diloco-family.md`). It's the only framework that already implements the orchestrator/trainer/inference split for RL with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their `verifiers` library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift.

	TRL provides the cleanest algorithm reference (`04-verl-trl.md`). `GRPOTrainer`, `OnlineDPOTrainer`, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the loss math from TRL but run on PRIME-RL's distributed substrate.

	VeRL's 3D-HybridEngine is the production benchmark for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework.

	Monarch + OpenEnv is the future bet, Ray + verifiers is today (`03-monarch-torchforge-openenv.md`). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature.

	DiLoCo is dormant infra until we scale beyond one cluster. Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop across data centers. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual trainer is still single-cluster FSDP2. We'd add Streaming DiLoCo only when:
	- Training compute exceeds one cluster, OR
	- We're recruiting volunteer compute (INTELLECT-1 model)

	For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer.

	## Your trace-replay distillation idea: where it fits

	From `05-trace-replay-distillation.md`:

	> No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the frozen-trace replay mechanism is new territory.

	The closest published precedents:

	\| Work \| What they do \| What you'd add \|
	\|---\|---\|---\|
	\| rStar / rStar-Math (Microsoft) \| MCTS at training time, single teacher branches at each step \| Replay pre-existing traces, multiple teachers, no MCTS at training time \|
	\| Math-Shepherd / OmegaPRM \| Process reward models from rollout-and-check \| Step-level teacher disagreement as the reward signal \|
	\| Magpie / OpenThoughts \| Synthetic data from one strong teacher \| Per-step distillation from N teachers on real traces \|
	\| MoA (Mixture of Agents) \| Multi-teacher response-level aggregation \| Per-step (sub-response) aggregation \|

	The novel claim:
	1. Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them).
	2. At each step `t`, replay the exact same state with N frozen teachers.
	3. Get N candidate `action_t` distributions.
	4. Use disagreement / agreement as a per-step reward signal for the student model.

	This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea. Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses a single model as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses N external pretrained models as teachers. Together:

	- Composer's hint-loss = same-model self-teacher with hint context pulls student at error sites (~1 extra forward pass / cheap, no API)
	- Trace-replay-loss = N external teachers pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001)

	These are complementary, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table.

	Cost mitigation (the report does this analysis well):
	- VOI gating (only query teachers when student entropy is high) → 60-80% savings
	- Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings
	- Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline

	Reward shape options (also in the report):
	1. Plurality vote (binary, simple)
	2. Weighted consensus
	3. DPO preference pairs ← recommended for v0.1: avoids reward model
	4. Variance-weighted (uncertainty-aware)
	5. Trained PRM ← recommended for production: amortizes cost

	## Proposed phase plan

	### v0.0 — proof of concept (1-2 weeks)

	Goal: Prove the trace-replay-distillation channel adds signal on top of plain GRPO.

	- Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B
	- Use TRL's `GRPOTrainer` directly, no decentralization yet
	- Environment: a single OpenEnv-compatible task (start with `swe-bench-lite` via verifiers, or stand up the "Feature Deletion" env on a small repo)
	- Trace source: 100 student rollouts, frozen as JSON
	- Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have)
	- Reward channel: DPO pairs from teacher-disagreement at step level
	- A/B comparison: plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost.
	- Skip Composer hint-distill and DiLoCo for now — those are v0.1+.

	### v0.1 — Composer-style recipe (1-2 months)

	Goal: All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment.

	- Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference
	- Build the "Feature Deletion" env as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact)
	- Implement the hint-distillation loss: error detector → text hint generator → KL distill at error turns
	- Bake in trace-replay-DPO as the third channel
	- Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE)
	- Single cluster, no DiLoCo
	- Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale

	### v0.2 — decentralized scaling (3-6 months)

	Goal: Run the v0.1 recipe across multiple clusters / volunteer compute.

	- Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync
	- Add SHARDCAST for inference-pool weight broadcast across DCs
	- Add TOPLOC-style verifiable inference if running with untrusted workers
	- Migrate orchestration from Ray to Monarch when Monarch's K8s story matures
	- Migrate environment hosting from inline-Docker to OpenEnv Hub
	- Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods

	## Open questions I'd want answered before starting

	1. Hint generator architecture — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike.
	2. Trace data source — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this.
	3. Teacher diversity vs cost — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate.
	4. Hardware target for v0.1 — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine.
	5. MoE vs dense — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target.

	## What we should NOT do

	- Don't build on TorchForge. Meta paused it. Lift patterns, not dependencies.
	- Don't try to replicate Composer's exact training mix. ~85% of their compute is post-training; you don't have that budget. Replicate the recipe shape, not the scale.
	- Don't add DiLoCo before you need it. Single-cluster training is fine until token budget says otherwise.
	- Don't forget the reward-hacking safeguards. Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1.
	- Don't skip RLVR ground-truth. The trace-replay channel is additional signal, not a replacement for "tests pass."

	## Sources

	All five research notes:
	- `~/wiki/research/post-training-framework/01-composer-2.5.md` (Cursor recipe deep-dive)
	- `~/wiki/research/post-training-framework/02-diloco-family.md` (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2)
	- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` (Meta's stack)
	- `~/wiki/research/post-training-framework/04-verl-trl.md` (algorithm libraries)
	- `~/wiki/research/post-training-framework/05-trace-replay-distillation.md` (your novelty assessment)

	Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports:
	- GRPO+DAPO is the consensus algorithm (3/4 reports, the 4th doesn't compare)
	- PRIME-RL is the most production-ready decentralized substrate (2 reports independently)
	- OpenEnv is the env-format winner (3 reports converge)
	- Trace-replay-with-N-teachers is genuinely under-explored (the trace-replay report's primary finding)

	## Next-step decision

	Three paths from here:

	1. Spike v0.0 — `skill_view('spike')` then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value.
	2. Plan first — `skill_view('writing-plans')` then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job.
	3. Deeper research first — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code.

	My recommendation is (1) Spike v0.0, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish.