Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Composer-Replication Framework: HF model → Composer-2.5-class agentic coder via decentralized RL | |
| > **Status:** Research synthesis (2026-05-25). Pre-spike. No code yet. | |
| > **Goal:** Build a framework that takes any HuggingFace base model and RL-post-trains it to Composer-2.5 quality on agentic coding (or any agentic domain), using decentralized DiLoCo-shape compute, Meta's Monarch/Forge orchestration, an OpenEnv environment registry, VeRL/TRL algorithm primitives, and a novel **trace-replay multi-teacher distillation** signal. | |
| > **Underlying research:** see `~/wiki/research/post-training-framework/{01..05}*.md` (5 deep-dives, ~2000-2500 words each, by 5 different model families: Gemini 3.1 Pro / DeepSeek V4 Pro / GPT-5 / Sonnet 4.6 / Kimi K2-Thinking). | |
| ## TL;DR | |
| | Component | Decision | Rationale | | |
| |---|---|---| | |
| | **Base model** | HF MoE (Kimi K2.5, DeepSeek-V3.2, Qwen3-Max-MoE) OR dense (Qwen3-32B, Llama-3-70B) | Composer-style requires MLA+MoE for fast/cheap serving; dense is simpler for v0.1 | | |
| | **Algorithm core** | GRPO + DAPO patches + Composer-style **on-policy distillation hint loss** | DAPO solves GRPO's length/std biases; Composer's hint-loss is the secret sauce | | |
| | **Training framework** | **PRIME-RL** (Prime Intellect) as substrate; **TRL** for algorithm correctness; borrow **VeRL's 3D-HybridEngine** patterns | PRIME-RL ships the orchestrator/trainer/inference split + decentralized story; TRL has cleanest GRPO+OpenEnv; VeRL's reshard logic is the production reference | | |
| | **Distributed sync** | **PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST)** for v0.1; bolt on **Streaming DiLoCo** outer loop only when scaling beyond one cluster | DiLoCo isn't useful when training fits one node. Add it when going multi-DC. | | |
| | **Environments** | **OpenEnv + verifiers (Hub-hosted)** with Cursor-style "Anyrun" sandboxes | OpenEnv is the emerging standard; HF + Meta backing; MCP tool-calling RFC landing | | |
| | **Reward signal** | Three-channel: (1) RLVR (tests pass), (2) **Composer hint-distill = SDPO/OPSD** (single-model, hint-conditioned self-teacher), (3) **Trace-replay multi-teacher PRM** (your novel idea — N external teachers) | Composer's hint-distill is published as SDPO ([arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [code](https://github.com/siyan-zhao/OPSD)); we lift it for v0.1. Channel (3) is genuinely novel and stacks on top. **They are TWO different mechanisms**, not competing implementations — see `docs/COMPOSER_RECIPE_MAPPING.md` for the precise distinction. | | |
| | **Trace-replay novelty** | Genuinely under-explored. Closest precedent: rStar-Math (single-teacher MCTS counterfactuals). Multi-teacher *frozen-trace replay* is open territory | Worth publishing if it works | | |
| | **Orchestration** | Monarch (when it matures) or Ray (today) for the actor mesh; **OpenEnv** for the env contract | Forge has been "development-paused" — borrow patterns, don't depend on it | | |
| ## What Composer 2.5 actually is, and what we're trying to replicate | |
| From `01-composer-2.5.md`: | |
| - **Base:** Moonshot's Kimi K2.5 — 1T total / 32B active MoE, MLA attention, DeepSeek-V3-derived, MuonClip optimizer, 256K native ctx. | |
| - **85% of total compute is post-training.** Pretraining is just the cheap starting point. | |
| - **The recipe (5 stages):** | |
| 1. **Continued pretraining** on heavily code-weighted data. Lower pretraining loss → better downstream RL. | |
| 2. **Synthetic data at scale** — 25× more synthetic tasks vs Composer 2. The headline trick: **"Feature Deletion"** — take a repo with passing tests, delete features, force the agent to reconstruct them. Tests are the verifiable reward. | |
| 3. **Realistic environment RL** — async sandboxes (their "Anyrun" system) with the *exact same tool harness* the model uses in production. Train on terse, ambiguous prompts requiring multi-file edits. | |
| 4. **🔑 Targeted RL with textual feedback (on-policy distillation).** When a 100K-token rollout has a localized error (wrong tool name, style violation), Cursor: | |
| - Generates a text hint correcting the error | |
| - Inserts the hint at the error turn | |
| - Runs forward pass with hint → "Teacher" logits | |
| - Runs forward pass without hint → "Student" logits | |
| - Applies KL divergence loss to pull Student toward Teacher *only at that turn* | |
| - This sidesteps the credit-assignment nightmare of long-horizon scalar rewards | |
| 5. **Sharded Muon + Dual Mesh HSDP** — separate sharding meshes for expert vs non-expert weights, optimized for Blackwell. | |
| - **Result:** ~69% Terminal-Bench 2.0 (parity with GPT-5.5), $0.50/$2.50 per 1M input/output (5-10× cheaper than peers). | |
| **Replicating this means cloning stages 1-4. Stage 5 is just MLOps.** And step 4 — the hint-distillation trick — is the *least obvious* and probably the most important. | |
| ## How the 5 component pieces fit together | |
| For the **rigorous integration architecture** — exact extension points in TRL (`GRPOTrainer._compute_loss` subclass), VeRL (`@register_adv_est` + `DataProto`), the OPSD loss `generalized_jsd_loss` lifted from `siyan-zhao/OPSD`, and the per-channel sequence diagrams — see [`docs/INTEGRATION_ARCHITECTURE.md`](docs/INTEGRATION_ARCHITECTURE.md). A working code skeleton with **38 passing unit tests** verifying the SDPO loss math, the trace-replay DPO-pair extraction, the data collator, and an end-to-end 5-step gradient run that decreases loss with all 3 channels active is at [`spikes/005-integrated-trainer-skeleton/`](spikes/005-integrated-trainer-skeleton/). | |
| The high-level topology: | |
| ``` | |
| ┌───────────────────────────────────────────┐ | |
| │ OpenEnv Environment Hub │ | |
| │ (HF Hub, Docker images, MCP tool-calling)│ | |
| │ - Anyrun-style code sandbox │ | |
| │ - SWE-Gym, SWE-Bench-Verified envs │ | |
| │ - "Feature Deletion" auto-grader env │ | |
| └────────────────┬──────────────────────────┘ | |
| │ rollouts (verifiers protocol) | |
| ▼ | |
| ┌────────────────────────────────────────────────────────────┐ | |
| │ ORCHESTRATOR (CPU) │ | |
| │ - Schedules rollouts across inference workers │ | |
| │ - Assembles training batches │ | |
| │ - Routes hint-distillation pairs (Composer-style) │ | |
| │ - Routes trace-replay teacher queries (NOVEL) │ | |
| │ - Built on Monarch (future) or Ray (today) │ | |
| └────┬──────────────────────────┬──────────────────────────┬─┘ | |
| │ rollout requests │ training batches │ teacher queries | |
| ▼ ▼ ▼ | |
| ┌─────────────────────┐ ┌────────────────────┐ ┌────────────────────────┐ | |
| │ INFERENCE POOL │ │ TRAINER (GPU) │ │ TEACHER POOL │ | |
| │ (vLLM / SGLang) │ │ - FSDP2 sharded │ │ - Frozen N teachers │ | |
| │ - Student policy │ │ - GRPO + DAPO │ │ - HF Inference, │ | |
| │ - Auto-resharded │ │ - +Hint distill │ │ OpenRouter, vLLM │ | |
| │ via SHARDCAST │ │ KL loss │ │ - Diverse families │ | |
| │ - Async tool waits │ │ - +PRM/DPO from │ │ (Anthropic / OpenAI │ | |
| │ don't block GPU │ │ trace-replay │ │ / DeepSeek / Qwen) │ | |
| └─────────────────────┘ └────────────────────┘ └────────────────────────┘ | |
| │ | |
| │ pseudo-gradients (every H steps) | |
| ▼ | |
| ┌────────────────────────────────┐ | |
| │ OUTER LOOP (DiLoCo, optional) │ | |
| │ - Only when training spans │ | |
| │ multiple clusters / DCs │ | |
| │ - Streaming variant for │ | |
| │ bandwidth-limited links │ | |
| └────────────────────────────────┘ | |
| ``` | |
| ### Why this stack | |
| **PRIME-RL is the right substrate** (`02-diloco-family.md`). It's the only framework that already implements the orchestrator/trainer/inference split *for RL* with proven decentralized story (INTELLECT-2: 32B QwQ-trained globally). Their `verifiers` library is the same env contract we'd want anyway. Their GRPO + AIPO importance-sampling correction handles the inevitable train↔inference logprob drift. | |
| **TRL provides the cleanest algorithm reference** (`04-verl-trl.md`). `GRPOTrainer`, `OnlineDPOTrainer`, and the new OpenEnv integration (Oct 2025) are well-tested. We'd lift the *loss math* from TRL but run on PRIME-RL's distributed substrate. | |
| **VeRL's 3D-HybridEngine is the production benchmark** for resharding between training-FSDP and inference-TP layouts. PRIME-RL does this too but VeRL has more battle-testing at 70B+. We borrow the resharding pattern, not the framework. | |
| **Monarch + OpenEnv is the future bet, Ray + verifiers is today** (`03-monarch-torchforge-openenv.md`). Forge is "development-paused" per Meta's banner — they're consolidating on TorchTitan. Don't build on Forge directly. But Monarch (the actor mesh) and OpenEnv (the env standard) are alive and well. v0.1 of our framework uses Ray + verifiers (PRIME-RL's stack); v0.2 swaps in Monarch + OpenEnv when those mature. | |
| **DiLoCo is dormant infra until we scale beyond one cluster.** Original DiLoCo / Streaming DiLoCo / OpenDiLoCo all assume an outer loop *across data centers*. INTELLECT-2 used DiLoCo-shape sync between geographically distributed inference workers, but the actual *trainer* is still single-cluster FSDP2. We'd add Streaming DiLoCo only when: | |
| - Training compute exceeds one cluster, OR | |
| - We're recruiting volunteer compute (INTELLECT-1 model) | |
| For v0.1: skip DiLoCo. Single-cluster PRIME-RL. The token budget is the bottleneck, not the trainer. | |
| ## Your trace-replay distillation idea: where it fits | |
| From `05-trace-replay-distillation.md`: | |
| > No published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision. While rStar uses MCTS for counterfactual evaluation and multi-teacher distillation exists, the **frozen-trace replay mechanism** is new territory. | |
| **The closest published precedents:** | |
| | Work | What they do | What you'd add | | |
| |---|---|---| | |
| | **rStar / rStar-Math** (Microsoft) | MCTS at training time, single teacher branches at each step | Replay pre-existing traces, *multiple* teachers, no MCTS at training time | | |
| | **Math-Shepherd / OmegaPRM** | Process reward models from rollout-and-check | Step-level *teacher disagreement* as the reward signal | | |
| | **Magpie / OpenThoughts** | Synthetic data from one strong teacher | Per-step distillation from N teachers on real traces | | |
| | **MoA (Mixture of Agents)** | Multi-teacher *response-level* aggregation | Per-step (sub-response) aggregation | | |
| **The novel claim:** | |
| 1. Take agentic traces (yours, or SWE-Gym, OpenHands, Cursor session exports if you can get them). | |
| 2. At each step `t`, replay the *exact same state* with N frozen teachers. | |
| 3. Get N candidate `action_t` distributions. | |
| 4. Use disagreement / agreement as a **per-step reward signal** for the student model. | |
| **This stacks beautifully with Composer's hint-distillation — but they are TWO DIFFERENT MECHANISMS, not competing implementations of the same idea.** Composer's hint-distill (= SDPO / OPSD, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802) + [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)) uses **a single model** as both teacher and student, with the teacher just being "the model with a hint inserted into context." Trace-replay-distill uses **N external pretrained models** as teachers. Together: | |
| - Composer's hint-loss = **same-model self-teacher with hint context** pulls student at error sites (~1 extra forward pass / cheap, no API) | |
| - Trace-replay-loss = **N external teachers** pull student at all sites (or high-uncertainty sites with VOI gating; ~$0.30/trace with gating per spike 001) | |
| These are **complementary**, not competing. Both give per-step KL signals that bypass the long-horizon credit assignment problem, but they tap different supervision sources. v0.1 of the framework runs both simultaneously. See [`docs/COMPOSER_RECIPE_MAPPING.md`](docs/COMPOSER_RECIPE_MAPPING.md) for the precise mathematical distinction and the implementation-handle table. | |
| **Cost mitigation** (the report does this analysis well): | |
| - VOI gating (only query teachers when student entropy is high) → 60-80% savings | |
| - Tiered teachers (cheap teacher first, escalate on disagreement) → 2-3× savings | |
| - Combined: ~$3/trace instead of ~$64/trace at the 1000-step / 8-teacher baseline | |
| **Reward shape options** (also in the report): | |
| 1. Plurality vote (binary, simple) | |
| 2. Weighted consensus | |
| 3. **DPO preference pairs** ← recommended for v0.1: avoids reward model | |
| 4. Variance-weighted (uncertainty-aware) | |
| 5. **Trained PRM** ← recommended for production: amortizes cost | |
| ## Proposed phase plan | |
| ### v0.0 — proof of concept (1-2 weeks) | |
| **Goal:** Prove the trace-replay-distillation channel adds signal on top of plain GRPO. | |
| - Pick smallest viable base: Qwen3-7B or Qwen3-Coder-7B | |
| - Use TRL's `GRPOTrainer` directly, no decentralization yet | |
| - Environment: a single OpenEnv-compatible task (start with `swe-bench-lite` via verifiers, or stand up the "Feature Deletion" env on a small repo) | |
| - Trace source: 100 student rollouts, frozen as JSON | |
| - Replay each step with N=3 teachers (Claude Opus 4.7, GPT-5, DeepSeek-V4-Pro via OpenRouter — this is what you have) | |
| - Reward channel: DPO pairs from teacher-disagreement at step level | |
| - **A/B comparison:** plain GRPO vs GRPO + trace-replay-DPO. Measure: SWE-bench-lite pass rate, train wallclock, teacher token cost. | |
| - Skip Composer hint-distill and DiLoCo for now — those are v0.1+. | |
| ### v0.1 — Composer-style recipe (1-2 months) | |
| **Goal:** All three reward channels (RLVR, hint-distill, trace-replay), plus the OpenEnv environment. | |
| - Migrate to PRIME-RL substrate: orchestrator + FSDP2 trainer + vLLM inference | |
| - Build the **"Feature Deletion" env** as a first-class OpenEnv-compatible environment (this is genuinely useful as a public artifact) | |
| - Implement the **hint-distillation loss**: error detector → text hint generator → KL distill at error turns | |
| - Bake in **trace-replay-DPO** as the third channel | |
| - Scale base to Qwen3-32B or Qwen3-Coder-30B-A3B (MoE) | |
| - Single cluster, no DiLoCo | |
| - Target: match Cursor's ~50% SWE-bench-multilingual at 32B scale | |
| ### v0.2 — decentralized scaling (3-6 months) | |
| **Goal:** Run the v0.1 recipe across multiple clusters / volunteer compute. | |
| - Add Streaming DiLoCo outer loop for trainer-side multi-cluster sync | |
| - Add SHARDCAST for inference-pool weight broadcast across DCs | |
| - Add TOPLOC-style verifiable inference if running with untrusted workers | |
| - Migrate orchestration from Ray to Monarch when Monarch's K8s story matures | |
| - Migrate environment hosting from inline-Docker to OpenEnv Hub | |
| - Target: re-run v0.1 recipe but with 2-3 geographic clusters or 4-6 volunteer pods | |
| ## Open questions I'd want answered before starting | |
| 1. **Hint generator architecture** — Cursor never says how their text hints are generated. Templates? Smaller model? Same model with introspection prompt? This is the biggest reproducibility gap. Probably worth a separate spike. | |
| 2. **Trace data source** — Do you have your own agent traces to replay (e.g., from your dogfood / kanban-orchestrator runs)? Or do we synthesize from public datasets (SWE-Gym, OpenHands)? Quality of replay signal depends heavily on this. | |
| 3. **Teacher diversity vs cost** — Is N=3 (Anthropic + OpenAI + DeepSeek) sufficient, or do we need N=8 (add Google, xAI, Qwen, Kimi, MiniMax)? Probably try N=3 in v0.0 and ablate. | |
| 4. **Hardware target for v0.1** — single 8×H100 node? 2× 8×H100? What's available to you? This decides whether the Megatron-LM path matters or FSDP2 is fine. | |
| 5. **MoE vs dense** — Composer's whole serving-cost story depends on MoE (1T total / 32B active). Going MoE adds expert sharding complexity. Dense Qwen3-32B might be the saner v0.1 target. | |
| ## What we should NOT do | |
| - **Don't build on TorchForge.** Meta paused it. Lift patterns, not dependencies. | |
| - **Don't try to replicate Composer's exact training mix.** ~85% of their compute is post-training; you don't have that budget. Replicate the *recipe shape*, not the scale. | |
| - **Don't add DiLoCo before you need it.** Single-cluster training is fine until token budget says otherwise. | |
| - **Don't forget the reward-hacking safeguards.** Cursor's blog mentions models learning to decompile bytecode to reconstruct deleted APIs. Plan for adversarial reward hacking from day 1. | |
| - **Don't skip RLVR ground-truth.** The trace-replay channel is *additional signal*, not a replacement for "tests pass." | |
| ## Sources | |
| All five research notes: | |
| - `~/wiki/research/post-training-framework/01-composer-2.5.md` (Cursor recipe deep-dive) | |
| - `~/wiki/research/post-training-framework/02-diloco-family.md` (DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2) | |
| - `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` (Meta's stack) | |
| - `~/wiki/research/post-training-framework/04-verl-trl.md` (algorithm libraries) | |
| - `~/wiki/research/post-training-framework/05-trace-replay-distillation.md` (your novelty assessment) | |
| Each was authored by a different model family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) for cross-family signal. Convergent findings across reports: | |
| - **GRPO+DAPO is the consensus algorithm** (3/4 reports, the 4th doesn't compare) | |
| - **PRIME-RL is the most production-ready decentralized substrate** (2 reports independently) | |
| - **OpenEnv is the env-format winner** (3 reports converge) | |
| - **Trace-replay-with-N-teachers is genuinely under-explored** (the trace-replay report's primary finding) | |
| ## Next-step decision | |
| Three paths from here: | |
| 1. **Spike v0.0** — `skill_view('spike')` then build the smallest possible "GRPO + trace-replay-DPO" comparison on Qwen3-7B. ~1 week. Cheapest signal on whether the novelty actually adds value. | |
| 2. **Plan first** — `skill_view('writing-plans')` then write a full implementation plan as a markdown plan doc with phases / subagent assignments. ~2 hours. Useful if you want to dispatch this as a kanban-orchestrator job. | |
| 3. **Deeper research first** — there are several open questions above (hint generator, trace data source). Could dispatch another scatter to nail those down before any code. | |
| My recommendation is **(1) Spike v0.0**, because the trace-replay-distillation idea is the highest-novelty piece and the cheapest to falsify. If trace-replay-DPO doesn't beat plain-GRPO on a 7B model with 100 traces and 3 teachers, the framework still has value (Composer recipe + PRIME-RL + OpenEnv), but the novel claim is dead and we should reorient. If it works, you publish. | |