Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Methodology — Composer 2.5 Replication Framework Research | |
| This document records *how* the research synthesis in this repo was produced, so | |
| the methodology is reproducible and the cross-family verification claim is | |
| auditable. | |
| ## Research dispatch | |
| On 2026-05-25, five parallel research subagents were dispatched via the | |
| [`delegate_task`](https://hermes-agent.nousresearch.com/) parallel-research | |
| pattern, one per topic. Each was given: | |
| - A specific research scope (one of: Composer 2.5 internals; DiLoCo family; | |
| Monarch / TorchForge / OpenEnv; VeRL / TRL; trace-replay distillation | |
| novelty assessment). | |
| - An explicit instruction to write findings to a known path | |
| (`~/wiki/research/post-training-framework/0X-<topic>.md`). | |
| - ~2000–2500 word target depth. | |
| - Web-research toolset (Tavily, Exa, AWS docs, MCP doc readers). | |
| Each subagent ran independently — no cross-agent communication, no shared | |
| intermediate state. They were given a uniform research scope but **routed to | |
| five different LLM families** for cross-family signal: | |
| | File | Author model | Rationale | | |
| |---|---|---| | |
| | `research/01-composer-2.5.md` | `google/gemini-3.1-pro-preview` | Long-context grounded research is Gemini's strong suit | | |
| | `research/02-diloco-family.md` | `deepseek/deepseek-v4-pro` | Strong on distributed-systems and pretraining literature | | |
| | `research/03-monarch-torchforge-openenv.md` | `openai/gpt-5` | Best at reading framework / SDK source code | | |
| | `research/04-verl-trl.md` | `anthropic/claude-sonnet-4.6` | Best at algorithmic precision (loss math, importance sampling) | | |
| | `research/05-trace-replay-distillation.md` | `moonshotai/kimi-k2-thinking` | Strong at novelty assessment and prior-art discovery | | |
| All routes were **verified post-hoc** via the per-task `model` field returned | |
| in the delegated agent's session metadata — i.e. the synthesis is not based on | |
| a single model's biases. | |
| ## Synthesis | |
| The master synthesis (`framework/composer-replication-framework.md`) was | |
| produced by reading all five reports in full and reconciling: | |
| - **Convergent claims** (≥2 independent reports agree) → promoted to | |
| framework-level decisions in the TL;DR table. | |
| - **Divergent claims** (reports recommend different stacks for the same | |
| layer) → noted explicitly with "use X today, switch to Y when Z" rationale | |
| rather than picking one arbitrarily. | |
| - **Single-source claims** (only one report makes the claim) → kept but | |
| flagged as "single-source — may be model bias" where consequential. | |
| Convergent findings (verified across reports): | |
| - **GRPO+DAPO is the consensus algorithm.** Reports 04 (TRL/VeRL deep-dive), | |
| 02 (PRIME-RL section), and 03 (Forge algorithm catalog) all converge on | |
| GRPO with DAPO patches as the production default for long-horizon agentic | |
| RL. | |
| - **PRIME-RL is the most production-ready decentralized substrate.** Reports | |
| 02 and 04 independently cite INTELLECT-2 (32B QwQ trained globally | |
| distributed) as the only production-scale decentralized RL run to date. | |
| - **OpenEnv is the env-format winner.** Reports 03 (Meta's stack), 04 (TRL's | |
| Oct 2025 OpenEnv integration), and 05 (env-substrate analysis) all | |
| converge on OpenEnv + verifiers as the emerging standard. | |
| - **Trace-replay multi-teacher is genuinely under-explored.** Report 05's | |
| primary finding, corroborated by the fact that none of the other 4 reports | |
| (which surveyed the algorithm and framework literature widely) mention | |
| per-step multi-teacher distillation as an existing technique. | |
| ## Sources | |
| The synthesis cites primary sources inline. Major primary sources include: | |
| - **Cursor blog**: <https://cursor.com/blog/composer-2-5> (the Composer 2.5 | |
| release post that motivated the whole project). | |
| - **Moonshot K2 paper**: <https://arxiv.org/abs/2502.05559> (Kimi K2 base | |
| model, the predecessor to K2.5). | |
| - **DeepMind DiLoCo paper**: <https://arxiv.org/abs/2311.08105>; **Streaming | |
| DiLoCo**: <https://arxiv.org/abs/2501.18512>. | |
| - **Prime Intellect INTELLECT-2 announcement**: <https://www.primeintellect.ai/blog/intellect-2>. | |
| - **VeRL paper**: <https://arxiv.org/abs/2409.19256>. | |
| - **HuggingFace TRL**: <https://github.com/huggingface/trl>. | |
| - **Microsoft rStar / rStar-Math**: <https://arxiv.org/abs/2408.06195>. | |
| - **Meta OpenEnv**: <https://github.com/meta-pytorch/openenv>. | |
| - **Meta Monarch**: <https://github.com/meta-pytorch/monarch>. | |
| The five research notes link to many more secondary sources (blog posts, | |
| twitter threads, individual repo READMEs). Those are auxiliary context, not | |
| primary evidence. | |
| ## Limitations | |
| - **No primary-source access to Cursor's training pipeline.** Composer 2.5's | |
| exact recipe is reconstructed from public statements; details like the | |
| text-hint generator architecture remain unverifiable. The biggest known | |
| gap is flagged in `framework/composer-replication-framework.md` § "Open | |
| questions." | |
| - **Pre-spike speculation.** The TL;DR table's stack picks are | |
| literature-backed but not yet empirically validated on this codebase. The | |
| v0.0 spike will produce the first empirical result. | |
| - **Single-snapshot research.** All five reports were produced on | |
| 2026-05-25. The field moves fast — TorchForge may un-pause, OpenEnv may | |
| fork, PRIME-RL may consolidate. Re-run the dispatch every 6 months. | |
| ## Reproducibility | |
| If you want to reproduce this research dispatch (or extend it with new | |
| topics), the pattern is: | |
| 1. Use the `delegate_task` parallel-research pattern (or any equivalent: one | |
| subagent per topic, all running in parallel, all writing to known paths). | |
| 2. **Route different topics to different model families** explicitly — this | |
| is the cross-family signal, and it requires a multi-model gateway like | |
| OpenRouter or your local equivalent. | |
| 3. Give each subagent a web-research toolset (Tavily, Exa, AWS docs, etc.) | |
| and ~10 min wall-clock budget. | |
| 4. After all reports return, verify each one's served `model` matches the | |
| intended route (per the route-fidelity discipline). | |
| 5. Read all reports in full (do not skim) and reconcile in a master synthesis | |
| doc that explicitly flags convergent vs single-source claims. | |
| This pattern generalizes beyond this project; it's the same approach used | |
| for any meaty literature-review task where a single model's perspective is | |
| suspect. | |