Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Methodology — Composer 2.5 Replication Framework Research
This document records how the research synthesis in this repo was produced, so the methodology is reproducible and the cross-family verification claim is auditable.
Research dispatch
On 2026-05-25, five parallel research subagents were dispatched via the
delegate_task parallel-research
pattern, one per topic. Each was given:
- A specific research scope (one of: Composer 2.5 internals; DiLoCo family; Monarch / TorchForge / OpenEnv; VeRL / TRL; trace-replay distillation novelty assessment).
- An explicit instruction to write findings to a known path
(
~/wiki/research/post-training-framework/0X-<topic>.md). - ~2000–2500 word target depth.
- Web-research toolset (Tavily, Exa, AWS docs, MCP doc readers).
Each subagent ran independently — no cross-agent communication, no shared intermediate state. They were given a uniform research scope but routed to five different LLM families for cross-family signal:
| File | Author model | Rationale |
|---|---|---|
research/01-composer-2.5.md |
google/gemini-3.1-pro-preview |
Long-context grounded research is Gemini's strong suit |
research/02-diloco-family.md |
deepseek/deepseek-v4-pro |
Strong on distributed-systems and pretraining literature |
research/03-monarch-torchforge-openenv.md |
openai/gpt-5 |
Best at reading framework / SDK source code |
research/04-verl-trl.md |
anthropic/claude-sonnet-4.6 |
Best at algorithmic precision (loss math, importance sampling) |
research/05-trace-replay-distillation.md |
moonshotai/kimi-k2-thinking |
Strong at novelty assessment and prior-art discovery |
All routes were verified post-hoc via the per-task model field returned
in the delegated agent's session metadata — i.e. the synthesis is not based on
a single model's biases.
Synthesis
The master synthesis (framework/composer-replication-framework.md) was
produced by reading all five reports in full and reconciling:
- Convergent claims (≥2 independent reports agree) → promoted to framework-level decisions in the TL;DR table.
- Divergent claims (reports recommend different stacks for the same layer) → noted explicitly with "use X today, switch to Y when Z" rationale rather than picking one arbitrarily.
- Single-source claims (only one report makes the claim) → kept but flagged as "single-source — may be model bias" where consequential.
Convergent findings (verified across reports):
- GRPO+DAPO is the consensus algorithm. Reports 04 (TRL/VeRL deep-dive), 02 (PRIME-RL section), and 03 (Forge algorithm catalog) all converge on GRPO with DAPO patches as the production default for long-horizon agentic RL.
- PRIME-RL is the most production-ready decentralized substrate. Reports 02 and 04 independently cite INTELLECT-2 (32B QwQ trained globally distributed) as the only production-scale decentralized RL run to date.
- OpenEnv is the env-format winner. Reports 03 (Meta's stack), 04 (TRL's Oct 2025 OpenEnv integration), and 05 (env-substrate analysis) all converge on OpenEnv + verifiers as the emerging standard.
- Trace-replay multi-teacher is genuinely under-explored. Report 05's primary finding, corroborated by the fact that none of the other 4 reports (which surveyed the algorithm and framework literature widely) mention per-step multi-teacher distillation as an existing technique.
Sources
The synthesis cites primary sources inline. Major primary sources include:
- Cursor blog: https://cursor.com/blog/composer-2-5 (the Composer 2.5 release post that motivated the whole project).
- Moonshot K2 paper: https://arxiv.org/abs/2502.05559 (Kimi K2 base model, the predecessor to K2.5).
- DeepMind DiLoCo paper: https://arxiv.org/abs/2311.08105; Streaming DiLoCo: https://arxiv.org/abs/2501.18512.
- Prime Intellect INTELLECT-2 announcement: https://www.primeintellect.ai/blog/intellect-2.
- VeRL paper: https://arxiv.org/abs/2409.19256.
- HuggingFace TRL: https://github.com/huggingface/trl.
- Microsoft rStar / rStar-Math: https://arxiv.org/abs/2408.06195.
- Meta OpenEnv: https://github.com/meta-pytorch/openenv.
- Meta Monarch: https://github.com/meta-pytorch/monarch.
The five research notes link to many more secondary sources (blog posts, twitter threads, individual repo READMEs). Those are auxiliary context, not primary evidence.
Limitations
- No primary-source access to Cursor's training pipeline. Composer 2.5's
exact recipe is reconstructed from public statements; details like the
text-hint generator architecture remain unverifiable. The biggest known
gap is flagged in
framework/composer-replication-framework.md§ "Open questions." - Pre-spike speculation. The TL;DR table's stack picks are literature-backed but not yet empirically validated on this codebase. The v0.0 spike will produce the first empirical result.
- Single-snapshot research. All five reports were produced on 2026-05-25. The field moves fast — TorchForge may un-pause, OpenEnv may fork, PRIME-RL may consolidate. Re-run the dispatch every 6 months.
Reproducibility
If you want to reproduce this research dispatch (or extend it with new topics), the pattern is:
- Use the
delegate_taskparallel-research pattern (or any equivalent: one subagent per topic, all running in parallel, all writing to known paths). - Route different topics to different model families explicitly — this is the cross-family signal, and it requires a multi-model gateway like OpenRouter or your local equivalent.
- Give each subagent a web-research toolset (Tavily, Exa, AWS docs, etc.) and ~10 min wall-clock budget.
- After all reports return, verify each one's served
modelmatches the intended route (per the route-fidelity discipline). - Read all reports in full (do not skim) and reconcile in a master synthesis doc that explicitly flags convergent vs single-source claims.
This pattern generalizes beyond this project; it's the same approach used for any meaty literature-review task where a single model's perspective is suspect.