Initial commit: Composer 2.5 Replication Framework — research synthesis

7165832 12 days ago

6.27 kB

	# Methodology — Composer 2.5 Replication Framework Research

	This document records how the research synthesis in this repo was produced, so
	the methodology is reproducible and the cross-family verification claim is
	auditable.

	## Research dispatch

	On 2026-05-25, five parallel research subagents were dispatched via the
	[`delegate_task`](https://hermes-agent.nousresearch.com/) parallel-research
	pattern, one per topic. Each was given:

	- A specific research scope (one of: Composer 2.5 internals; DiLoCo family;
	Monarch / TorchForge / OpenEnv; VeRL / TRL; trace-replay distillation
	novelty assessment).
	- An explicit instruction to write findings to a known path
	(`~/wiki/research/post-training-framework/0X-<topic>.md`).
	- ~2000–2500 word target depth.
	- Web-research toolset (Tavily, Exa, AWS docs, MCP doc readers).

	Each subagent ran independently — no cross-agent communication, no shared
	intermediate state. They were given a uniform research scope but **routed to
	five different LLM families** for cross-family signal:

	\| File \| Author model \| Rationale \|
	\|---\|---\|---\|
	\| `research/01-composer-2.5.md` \| `google/gemini-3.1-pro-preview` \| Long-context grounded research is Gemini's strong suit \|
	\| `research/02-diloco-family.md` \| `deepseek/deepseek-v4-pro` \| Strong on distributed-systems and pretraining literature \|
	\| `research/03-monarch-torchforge-openenv.md` \| `openai/gpt-5` \| Best at reading framework / SDK source code \|
	\| `research/04-verl-trl.md` \| `anthropic/claude-sonnet-4.6` \| Best at algorithmic precision (loss math, importance sampling) \|
	\| `research/05-trace-replay-distillation.md` \| `moonshotai/kimi-k2-thinking` \| Strong at novelty assessment and prior-art discovery \|

	All routes were verified post-hoc via the per-task `model` field returned
	in the delegated agent's session metadata — i.e. the synthesis is not based on
	a single model's biases.

	## Synthesis

	The master synthesis (`framework/composer-replication-framework.md`) was
	produced by reading all five reports in full and reconciling:

	- Convergent claims (≥2 independent reports agree) → promoted to
	framework-level decisions in the TL;DR table.
	- Divergent claims (reports recommend different stacks for the same
	layer) → noted explicitly with "use X today, switch to Y when Z" rationale
	rather than picking one arbitrarily.
	- Single-source claims (only one report makes the claim) → kept but
	flagged as "single-source — may be model bias" where consequential.

	Convergent findings (verified across reports):

	- GRPO+DAPO is the consensus algorithm. Reports 04 (TRL/VeRL deep-dive),
	02 (PRIME-RL section), and 03 (Forge algorithm catalog) all converge on
	GRPO with DAPO patches as the production default for long-horizon agentic
	RL.
	- PRIME-RL is the most production-ready decentralized substrate. Reports
	02 and 04 independently cite INTELLECT-2 (32B QwQ trained globally
	distributed) as the only production-scale decentralized RL run to date.
	- OpenEnv is the env-format winner. Reports 03 (Meta's stack), 04 (TRL's
	Oct 2025 OpenEnv integration), and 05 (env-substrate analysis) all
	converge on OpenEnv + verifiers as the emerging standard.
	- Trace-replay multi-teacher is genuinely under-explored. Report 05's
	primary finding, corroborated by the fact that none of the other 4 reports
	(which surveyed the algorithm and framework literature widely) mention
	per-step multi-teacher distillation as an existing technique.

	## Sources

	The synthesis cites primary sources inline. Major primary sources include:

	- Cursor blog: <https://cursor.com/blog/composer-2-5> (the Composer 2.5
	release post that motivated the whole project).
	- Moonshot K2 paper: <https://arxiv.org/abs/2502.05559> (Kimi K2 base
	model, the predecessor to K2.5).
	- DeepMind DiLoCo paper: <https://arxiv.org/abs/2311.08105>; **Streaming
	DiLoCo**: <https://arxiv.org/abs/2501.18512>.
	- Prime Intellect INTELLECT-2 announcement: <https://www.primeintellect.ai/blog/intellect-2>.
	- VeRL paper: <https://arxiv.org/abs/2409.19256>.
	- HuggingFace TRL: <https://github.com/huggingface/trl>.
	- Microsoft rStar / rStar-Math: <https://arxiv.org/abs/2408.06195>.
	- Meta OpenEnv: <https://github.com/meta-pytorch/openenv>.
	- Meta Monarch: <https://github.com/meta-pytorch/monarch>.

	The five research notes link to many more secondary sources (blog posts,
	twitter threads, individual repo READMEs). Those are auxiliary context, not
	primary evidence.

	## Limitations

	- No primary-source access to Cursor's training pipeline. Composer 2.5's
	exact recipe is reconstructed from public statements; details like the
	text-hint generator architecture remain unverifiable. The biggest known
	gap is flagged in `framework/composer-replication-framework.md` § "Open
	questions."
	- Pre-spike speculation. The TL;DR table's stack picks are
	literature-backed but not yet empirically validated on this codebase. The
	v0.0 spike will produce the first empirical result.
	- Single-snapshot research. All five reports were produced on
	2026-05-25. The field moves fast — TorchForge may un-pause, OpenEnv may
	fork, PRIME-RL may consolidate. Re-run the dispatch every 6 months.

	## Reproducibility

	If you want to reproduce this research dispatch (or extend it with new
	topics), the pattern is:

	1. Use the `delegate_task` parallel-research pattern (or any equivalent: one
	subagent per topic, all running in parallel, all writing to known paths).
	2. Route different topics to different model families explicitly — this
	is the cross-family signal, and it requires a multi-model gateway like
	OpenRouter or your local equivalent.
	3. Give each subagent a web-research toolset (Tavily, Exa, AWS docs, etc.)
	and ~10 min wall-clock budget.
	4. After all reports return, verify each one's served `model` matches the
	intended route (per the route-fidelity discipline).
	5. Read all reports in full (do not skim) and reconcile in a master synthesis
	doc that explicitly flags convergent vs single-source claims.

	This pattern generalizes beyond this project; it's the same approach used
	for any meaty literature-review task where a single model's perspective is
	suspect.