File size: 6,269 Bytes
7165832
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# Methodology — Composer 2.5 Replication Framework Research

This document records *how* the research synthesis in this repo was produced, so
the methodology is reproducible and the cross-family verification claim is
auditable.

## Research dispatch

On 2026-05-25, five parallel research subagents were dispatched via the
[`delegate_task`](https://hermes-agent.nousresearch.com/) parallel-research
pattern, one per topic. Each was given:

- A specific research scope (one of: Composer 2.5 internals; DiLoCo family;
  Monarch / TorchForge / OpenEnv; VeRL / TRL; trace-replay distillation
  novelty assessment).
- An explicit instruction to write findings to a known path
  (`~/wiki/research/post-training-framework/0X-<topic>.md`).
- ~2000–2500 word target depth.
- Web-research toolset (Tavily, Exa, AWS docs, MCP doc readers).

Each subagent ran independently — no cross-agent communication, no shared
intermediate state. They were given a uniform research scope but **routed to
five different LLM families** for cross-family signal:

| File | Author model | Rationale |
|---|---|---|
| `research/01-composer-2.5.md` | `google/gemini-3.1-pro-preview` | Long-context grounded research is Gemini's strong suit |
| `research/02-diloco-family.md` | `deepseek/deepseek-v4-pro` | Strong on distributed-systems and pretraining literature |
| `research/03-monarch-torchforge-openenv.md` | `openai/gpt-5` | Best at reading framework / SDK source code |
| `research/04-verl-trl.md` | `anthropic/claude-sonnet-4.6` | Best at algorithmic precision (loss math, importance sampling) |
| `research/05-trace-replay-distillation.md` | `moonshotai/kimi-k2-thinking` | Strong at novelty assessment and prior-art discovery |

All routes were **verified post-hoc** via the per-task `model` field returned
in the delegated agent's session metadata — i.e. the synthesis is not based on
a single model's biases.

## Synthesis

The master synthesis (`framework/composer-replication-framework.md`) was
produced by reading all five reports in full and reconciling:

- **Convergent claims** (≥2 independent reports agree) → promoted to
  framework-level decisions in the TL;DR table.
- **Divergent claims** (reports recommend different stacks for the same
  layer) → noted explicitly with "use X today, switch to Y when Z" rationale
  rather than picking one arbitrarily.
- **Single-source claims** (only one report makes the claim) → kept but
  flagged as "single-source — may be model bias" where consequential.

Convergent findings (verified across reports):

- **GRPO+DAPO is the consensus algorithm.** Reports 04 (TRL/VeRL deep-dive),
  02 (PRIME-RL section), and 03 (Forge algorithm catalog) all converge on
  GRPO with DAPO patches as the production default for long-horizon agentic
  RL.
- **PRIME-RL is the most production-ready decentralized substrate.** Reports
  02 and 04 independently cite INTELLECT-2 (32B QwQ trained globally
  distributed) as the only production-scale decentralized RL run to date.
- **OpenEnv is the env-format winner.** Reports 03 (Meta's stack), 04 (TRL's
  Oct 2025 OpenEnv integration), and 05 (env-substrate analysis) all
  converge on OpenEnv + verifiers as the emerging standard.
- **Trace-replay multi-teacher is genuinely under-explored.** Report 05's
  primary finding, corroborated by the fact that none of the other 4 reports
  (which surveyed the algorithm and framework literature widely) mention
  per-step multi-teacher distillation as an existing technique.

## Sources

The synthesis cites primary sources inline. Major primary sources include:

- **Cursor blog**: <https://cursor.com/blog/composer-2-5> (the Composer 2.5
  release post that motivated the whole project).
- **Moonshot K2 paper**: <https://arxiv.org/abs/2502.05559> (Kimi K2 base
  model, the predecessor to K2.5).
- **DeepMind DiLoCo paper**: <https://arxiv.org/abs/2311.08105>; **Streaming
  DiLoCo**: <https://arxiv.org/abs/2501.18512>.
- **Prime Intellect INTELLECT-2 announcement**: <https://www.primeintellect.ai/blog/intellect-2>.
- **VeRL paper**: <https://arxiv.org/abs/2409.19256>.
- **HuggingFace TRL**: <https://github.com/huggingface/trl>.
- **Microsoft rStar / rStar-Math**: <https://arxiv.org/abs/2408.06195>.
- **Meta OpenEnv**: <https://github.com/meta-pytorch/openenv>.
- **Meta Monarch**: <https://github.com/meta-pytorch/monarch>.

The five research notes link to many more secondary sources (blog posts,
twitter threads, individual repo READMEs). Those are auxiliary context, not
primary evidence.

## Limitations

- **No primary-source access to Cursor's training pipeline.** Composer 2.5's
  exact recipe is reconstructed from public statements; details like the
  text-hint generator architecture remain unverifiable. The biggest known
  gap is flagged in `framework/composer-replication-framework.md` § "Open
  questions."
- **Pre-spike speculation.** The TL;DR table's stack picks are
  literature-backed but not yet empirically validated on this codebase. The
  v0.0 spike will produce the first empirical result.
- **Single-snapshot research.** All five reports were produced on
  2026-05-25. The field moves fast — TorchForge may un-pause, OpenEnv may
  fork, PRIME-RL may consolidate. Re-run the dispatch every 6 months.

## Reproducibility

If you want to reproduce this research dispatch (or extend it with new
topics), the pattern is:

1. Use the `delegate_task` parallel-research pattern (or any equivalent: one
   subagent per topic, all running in parallel, all writing to known paths).
2. **Route different topics to different model families** explicitly — this
   is the cross-family signal, and it requires a multi-model gateway like
   OpenRouter or your local equivalent.
3. Give each subagent a web-research toolset (Tavily, Exa, AWS docs, etc.)
   and ~10 min wall-clock budget.
4. After all reports return, verify each one's served `model` matches the
   intended route (per the route-fidelity discipline).
5. Read all reports in full (do not skim) and reconcile in a master synthesis
   doc that explicitly flags convergent vs single-source claims.

This pattern generalizes beyond this project; it's the same approach used
for any meaty literature-review task where a single model's perspective is
suspect.