composer-replication-framework / docs /INTEGRATION_RECIPES.md

Wave 16: install ergonomics + gradient evidence + SDPO end-to-end example

c0a5ab7 10 days ago

41.7 kB

	# INTEGRATION_RECIPES.md — Wiring the 3-channel composer loss into your RL stack

	> Status: Wave 14 release reference. Supersedes the historical
	> [`docs/INTEGRATION_ARCHITECTURE.md`](INTEGRATION_ARCHITECTURE.md) (Recipes
	> A–D), which is retained as background reading for the original
	> mechanism-level diagrams.
	>
	> Companion docs:
	> - [`docs/USER_GUIDE.md`](USER_GUIDE.md) — narrative walk-through, sections 1–8
	> - [`docs/API_REFERENCE.md`](API_REFERENCE.md) — exact kwarg signatures
	> - [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) — error → fix index
	> - [`docs/V3_SUBSTRATE_COVERAGE.md`](V3_SUBSTRATE_COVERAGE.md) — what each
	> substrate covers
	> - [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md) —
	> why these five recipes and not others

	This document is the canonical answer to **"how do I plug the 3-channel
	composer loss into framework X?"** for the five frameworks the project
	supports as of Wave 14:

	1. [TRL `GRPOTrainer` subclass](#recipe-1--trl-grpotrainer-subclass)
	2. [VeRL custom `adv_estimator` + DataProto extension](#recipe-2--verl-custom-adv_estimator--dataproto-extension)
	3. [PRIME-RL custom-loss config](#recipe-3--prime-rl-customlossconfig)
	4. [Serverless Decoupled DiLoCo (Modal / HF Jobs / SageMaker)](#recipe-4--serverless-decoupled-diloco)
	5. [Monarch actor mesh (TorchForge-style topology)](#recipe-5--monarch-actor-mesh)

	Each recipe follows the same seven-part template:

	1. When to use it — decision criteria.
	2. Install command — which optional extras of `composer-replication`.
	3. Minimum-viable Python script — copy-pasteable, ≤ 60 lines.
	4. Decoupled DiLoCo wiring — how `ServerlessExecutor` +
	`ObjectStoreAllReduce` + `MockManager` layer on top.
	5. Distillation-loss wiring — how to switch DPO → SimPO and add TAID
	via `compose_loss(..., dpo_variant=..., sdpo_wrapper=...)` or the
	recipe's own loss-config field.
	6. Cost ballpark — GPU $/hr + API spend, sourced from
	[`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md).
	7. Known limitations as of Wave 14.

	A cross-recipe [comparison matrix](#comparison-matrix) closes the doc.

	## TL;DR — the unified loss

	For any of the five recipes, the v0.1 trainer step computes:

	```
	total_loss = grpo_loss
	+ α * sdpo_kl_loss (channel 2 — Composer hint-distill;
	optional TAID or Entropy-OPD wrapper)
	+ β * trace_replay_loss (channel 3 — N-teacher DPO;
	switchable to SimPO)
	```

	This is implemented once, in
	[`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
	and re-used by every recipe via the kwargs documented in
	[`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including
	all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`,
	`simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the
	single source of truth in
	[API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss).
	The conceptual call shape is just:

	```python
	compose_loss(model, inputs, **kwargs) # see API_REFERENCE.md#compose_loss for full signature
	```

	All five recipes below either call `compose_loss` directly or call a
	thin per-framework adapter that forwards these kwargs unchanged. Each
	recipe's §5 Distillation-loss wiring documents the kwargs *that
	recipe* uses by default and why; refer back to API_REFERENCE.md for
	defaults, types, and which kwargs are mutually exclusive.

	---

	## Recipe 1 — TRL `GRPOTrainer` subclass

	### 1. When to use it

	This is the default v0.0/v0.1 path and the one we recommend for
	~99% of users today. Pick TRL when:

	- Your model fits on ≤ 32 GPUs (typically ≤ 70B-param FSDP).
	- You already have a HuggingFace `model` + `tokenizer` + `datasets` flow.
	- You want minimum integration cost — `ComposerReplicationTrainer` is a
	single subclass override of `_compute_loss` over `trl.GRPOTrainer`,
	no Ray, no actor mesh.
	- You're doing single-host (one node, possibly multi-GPU FSDP) training.

	Don't pick TRL when you need >100 B-param scale, when you must async-decouple
	tool calls from the GPU loop, or when a Ray cluster is already in your stack
	(in which case Recipe 2 is cheaper).

	### 2. Install command

	```bash
	pip install -e ".[train,replaysim]"
	```

	The `train` extra pulls `trl>=0.12`, `peft`, `accelerate`, and `datasets`.
	The `replaysim` extra pulls `data-juicer` for CPU-side DPO normalization
	(channel 3 cleaning step). Add `[serverless]` if you also want Decoupled
	DiLoCo (see step 4).

	### 3. Minimum-viable Python script

	```python
	# train_trl.py — minimum viable Recipe 1
	from datasets import load_dataset
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from composer_replication import ComposerReplicationTrainer

	MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" # swap for 7B once it works
	model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	dataset = load_dataset("trl-lib/tldr", split="train[:512]")

	def reward_length(completions, **_):
	return [-abs(len(c) - 64) for c in completions]

	trainer = ComposerReplicationTrainer(
	model = model,
	processing_class = tokenizer,
	reward_funcs = [reward_length],
	train_dataset = dataset,
	# Composer extras (defaults shown):
	alpha_sdpo = 0.1,
	beta_replay = 0.05,
	sdpo_jsd_beta = 0.5,
	sdpo_temperature = 1.0,
	sdpo_token_clip = None,
	replay_dpo_beta = 0.1,
	)
	trainer.train()
	```

	Channels 2 and 3 auto-disable per step when their inputs aren't
	present in the batch (e.g. batches with no error sites get
	`sdpo_kl=0`). Set `alpha_sdpo=0` / `beta_replay=0` to disable globally
	for ablations.

	### 4. Decoupled DiLoCo wiring

	`ComposerReplicationTrainer` is a single-process trainer. To run N
	replicas of it under Decoupled DiLoCo, layer the serverless stack on the
	outside: each replica runs the script above; `MockManager` stands in for
	`torchft.Manager` on the inner loop and `ObjectStoreAllReduce` runs the
	outer-loop pseudo-gradient exchange:

	```python
	# diloco_replica.py — what each of the N replicas runs
	import os
	from composer_replication.diloco import make_diloco_outer_loop
	from composer_replication.diloco.serverless import (
	LocalProcessExecutor, ObjectStoreAllReduce, MockManager,
	)

	rendezvous = ObjectStoreAllReduce(
	uri = "s3://my-bucket/diloco-runs/run42/",
	world_size = 4,
	rank = int(os.environ["REPLICA_RANK"]),
	)
	manager = MockManager(allreduce=rendezvous)
	# trainer.optimizer is the inner optimizer; the outer is built here:
	outer = make_diloco_outer_loop(
	inner_optimizer = trainer.optimizer,
	manager = manager,
	sync_every_h = 500,
	)
	trainer.add_callback(outer.callback()) # syncs every H inner steps
	trainer.train()
	```

	The driver process spins these up with any `ServerlessExecutor`:

	```python
	# Wave 14: ModalExecutor / HFJobsExecutor are skeletons (raise NotImplementedError);
	# use LocalProcessExecutor for testing. Swap once the cloud backends land.
	executor = LocalProcessExecutor()
	handles = executor.launch_replicas(
	n_replicas = 4,
	entrypoint = "diloco_replica.py",
	entrypoint_args = {"rendezvous": rendezvous.uri,
	"rank_env": "REPLICA_RANK"},
	)
	result = executor.collect(handles, timeout=3600)
	```

	### 5. Distillation-loss wiring

	`ComposerReplicationTrainer` exposes the new ADR-007 channels via the
	shared `compose_loss` kwargs — pass them through `**kwargs` on the
	trainer and they're forwarded to `compose_loss`:

	```python
	trainer = ComposerReplicationTrainer(
	model = model, processing_class = tokenizer,
	reward_funcs = [reward_length], train_dataset = dataset,
	# SimPO instead of DPO for channel 3:
	dpo_variant = "simpo",
	simpo_beta = 2.0,
	simpo_gamma = 1.0,
	# TAID for channel 2 (SakanaAI port; logit-space mix + forward-KL):
	sdpo_wrapper = "taid",
	taid_t = 0.4, # current TAID coeff in [0, 1];
	# drive from TAIDScheduler if you want
	# the paper's adaptive scheme
	)
	```

	Or, equivalently, drop `entropy_opd` in for `taid` if you want
	per-token entropy-gated forward/reverse KL instead of the
	linear-blend interpolation. SimPO does not require reference
	log-probs (channel 3 batches with `dpo_chosen_ref_logprobs` /
	`dpo_rejected_ref_logprobs` set are silently ignored).

	### 6. Cost ballpark

	- GPU: single host, `g5.12xlarge` ($5.67/hr) or RunPod 4×A100-80GB
	(~$5–9/hr) gets you Qwen2.5-7B at moderate throughput. For Qwen2.5-72B
	you'll want 2–4× H100 — `p5.48xlarge` (~$98/hr on AWS, ~$25–30/hr on
	Lambda Cloud / RunPod community).
	- API: channel 3 teacher replay via OpenRouter — verified
	~$0.98/trace at 50 steps × 3 teachers (spike 001). For a 100-trace
	curriculum that's ~$100 in teacher tokens.
	- Storage: negligible until you turn on DiLoCo (then see Recipe 4).

	### 7. Known limitations as of Wave 14

	- Tool calls block the GPU. TRL's rollout is synchronous; long
	tool-call latency idles the trainer. Async-decouple via Recipe 2/3/5
	if this matters.
	- No native multi-node. TRL is single-process; multi-host scaling is
	via Decoupled DiLoCo (Recipe 4) on top, not via TRL itself.
	- vLLM weight sync is co-located — no resharding between FSDP and TP.
	At 70B+ this becomes the bottleneck and you should move to Recipe 2.
	- `reward_funcs` must be Python callables that return `list[float]`;
	shell-out reward graders need a wrapper.

	---

	## Recipe 2 — VeRL custom `adv_estimator` + DataProto extension

	### 1. When to use it

	Pick VeRL when:

	- You need >70B-param scale or >32-GPU multi-host, and a Ray cluster
	is acceptable in your stack.
	- You're already using or willing to adopt 3D-HybridEngine for
	efficient FSDP↔TP weight resharding (verified ~5× weight-sync speed-up
	vs co-located vLLM at 70B+).
	- You need async multi-turn rollouts where tool-call latency must not
	block the GPU loop. VeRL's `AsyncServer` + `AgentLoop` is the
	best-in-class option here.
	- You want extension points the framework's authors expect third
	parties to use — the `@register_adv_est("...")` decorator and the
	`DataProto` extension contract are first-class APIs.

	Don't pick VeRL if you're <7B-param or single-host (overkill —
	Recipe 1's Trainer subclass is one file, not a Ray cluster).

	### 2. Install command

	```bash
	pip install -e ".[replaysim]"
	pip install verl # not packaged as an extra; pinned at >=0.3
	# Optional, for the Composer adapter:
	pip install -e ".[serverless]" # for Decoupled DiLoCo on top
	```

	The framework's verl adapter lives at
	`composer_replication.recipes.verl` (currently shape-only — see
	[Limitations](#7-known-limitations-as-of-wave-14-2) below).

	### 3. Minimum-viable Python script

	VeRL's actual entry point is a Hydra/YAML config + `verl.trainer.main_ppo`
	CLI; the pythonic surface looks like this:

	```python
	# train_verl.py — minimum viable Recipe 2 sketch
	from verl.trainer.ppo import core_algos
	from verl.trainer.ppo.ray_trainer import RayPPOTrainer
	from composer_replication.loss import compose_loss

	@core_algos.register_adv_est("grpo_composer")
	def composer_advantage(data, **kwargs):
	"""Custom adv-estimator that adds SDPO + DPO channels to GRPO.

	Reads three extra DataProto keys (populated by the data prep step):
	- data.batch["sdpo_teacher_logits"] (channel 2)
	- data.non_tensor_batch["teacher_actions"] (channel 3)
	and returns the standard (advantages, returns) tuple plus a stashed
	composer-loss term consumed by the critic worker.
	"""
	advantages, returns = core_algos.compute_grpo_outcome_advantage(data, **kwargs)
	composer_term = compose_loss(
	model = kwargs["actor_module"],
	inputs = data.batch,
	alpha_sdpo = 0.1,
	beta_replay = 0.05,
	dpo_variant = "dpo",
	sdpo_wrapper = "none",
	)
	data.meta_info["composer_loss"] = composer_term
	return advantages, returns

	# Then in your YAML:
	# algorithm:
	# adv_estimator: grpo_composer
	# and run: python -m verl.trainer.main_ppo --config-name composer_grpo
	```

	The full driver wires `RayPPOTrainer` against your config; consult VeRL's
	own quickstart for the Ray-cluster boilerplate. The composer-specific
	piece is just the registered estimator above.

	### 4. Decoupled DiLoCo wiring

	VeRL's actor workers run in Ray; DiLoCo replicates the whole VeRL job.
	Each "replica" is one Ray cluster running Recipe 2 end-to-end; the outer
	loop is independent of Ray and just exchanges pseudo-gradients via the
	object store between Ray-job invocations:

	```python
	from composer_replication.diloco.serverless import (
	LocalProcessExecutor, ObjectStoreAllReduce,
	)

	rendezvous = ObjectStoreAllReduce(
	uri = "s3://verl-diloco/run/",
	world_size = 4,
	)
	executor = LocalProcessExecutor() # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError) — keep LocalProcessExecutor for now
	handles = executor.launch_replicas(
	n_replicas = 4,
	entrypoint = "verl.trainer.main_ppo",
	entrypoint_args = {
	"+algorithm.adv_estimator": "grpo_composer",
	"+algorithm.diloco.rendezvous": rendezvous.uri,
	"+algorithm.diloco.sync_every_h": 500,
	},
	)
	executor.collect(handles, timeout=24 * 3600)
	```

	The Ray cluster inside each replica handles intra-replica scaling
	(FSDP / TP / vLLM); the object-store exchange handles cross-replica
	sync. Bandwidth is identical to Recipe 1 (~2 GB / 30 min per replica
	for a 7B-param model in bf16) and well within S3 free-tier.

	### 5. Distillation-loss wiring

	The custom `adv_estimator` from step 3 already calls `compose_loss`;
	flip the kwargs there to switch DPO → SimPO or add TAID:

	```python
	composer_term = compose_loss(
	model = kwargs["actor_module"],
	inputs = data.batch,
	alpha_sdpo = 0.1,
	beta_replay = 0.05,
	dpo_variant = "simpo", # ← SimPO swap
	simpo_beta = 2.0,
	simpo_gamma = 1.0,
	sdpo_wrapper = "taid", # ← TAID wrap
	taid_schedule_step = data.meta_info.get("global_step", 0),
	taid_total_steps = 10_000,
	)
	```

	VeRL's `data.meta_info` carries the global step automatically, which is
	exactly what TAID's interpolation schedule needs. Channel 2 batches
	without `student_init_logits` / `student_init_input_ids` are auto-skipped
	(returns 0 for that step).

	### 6. Cost ballpark

	- GPU: 8× H100 (`p5.48xlarge` ~$98/hr on AWS, ~$25/hr on Lambda or
	RunPod community) is the entry point for 70B-class. Expect 32–256
	H100 for full 671B (matches DeepSeek's reported VeRL config).
	- API: same ~$0.98/trace as Recipe 1 (channel 3 is a Python helper,
	not a VeRL primitive — costs are framework-independent).
	- Ray cluster overhead: head node + redis + dashboard adds ~1
	CPU-instance ($0.10–0.50/hr) per cluster, negligible at GPU scale.

	### 7. Known limitations as of Wave 14

	- `composer_replication.recipes.verl` is shape-only. The decorator
	registration and DataProto extension are documented but not yet shipped
	as a runnable adapter — Wave 14 release exposes the contract, not the
	glue. Expect this to land in a v0.2 follow-up spike.
	- Ray dependency. Adds a heavyweight runtime; debugging
	cross-actor crashes can be painful. Use VeRL's `--debug` mode early.
	- Custom-`adv_estimator` LOC: writing your own takes ~50–150 LOC
	including DataProto plumbing. Not a one-liner.
	- No first-class TAID hook in VeRL itself — we route TAID through
	the meta_info channel; this works but means you can't use VeRL's
	built-in checkpoint-replay tooling without re-stamping `taid_schedule_step`
	on each replay.

	---

	## Recipe 3 — PRIME-RL `CustomLossConfig`

	### 1. When to use it

	Pick PRIME-RL when:

	- You're operating in the PRIME-Intellect / decentralized training
	universe and want INTELLECT-style scaling on a long-horizon training
	run.
	- You need DPPO importance-ratio masking (the rationale most users
	arrive with) — PRIME-RL's headline contribution is the
	out-of-band-token mask (not clip) on `log_ratio = trainer_lp -
	inference_lp`, with defaults `low=-4.0, high=4.0`.
	- You want a first-class custom-loss surface: PRIME-RL ships
	`CustomLossConfig` that takes an importable Python function and a
	`LossInputs` struct exposing exactly the tensors we need
	(`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
	`advantages`, `loss_mask`). No fork, no Trainer subclass, no monkey-patch.
	- You have access to multi-node infrastructure that PRIME-RL's
	trainer/inference/orchestrator split is designed for.

	Don't pick PRIME-RL if you need full vocab logits (channel 2 SDPO
	requires logits not log-probs — see Limitations).

	### 2. Install command

	```bash
	pip install -e ".[prime-rl,replaysim]"
	# pulls prime-rl>=0.5
	```

	### 3. Minimum-viable Python script

	PRIME-RL drives via YAML config; the only Python you write is the
	custom-loss function (already shipped at
	`composer_replication/recipes/prime_rl/composer_loss.py`). Wire it in:

	```yaml
	# prime_rl_config.yaml — point at the framework's adapter
	loss:
	custom:
	import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn
	kwargs:
	alpha_sdpo: 0.0 # channel 2 deferred in v0 (see below)
	beta_dpo: 0.0 # channel 3 emits a warning if non-zero
	dppo_mask_high: 4.0 # PRIME-RL DPPO mask bounds
	dppo_mask_low: -4.0
	epsilon: 1.0e-6

	trainer:
	model: Qwen/Qwen2.5-7B-Instruct
	... # standard PRIME-RL fields
	```

	The shipped `loss_fn` signature is fixed by PRIME-RL's contract:

	```python
	def loss_fn(
	inputs: LossInputs,
	*,
	alpha_sdpo: float = 0.0,
	beta_dpo: float = 0.0,
	dppo_mask_high: float = 4.0,
	dppo_mask_low: float = -4.0,
	epsilon: float = 1e-6,
	) -> torch.Tensor:
	log_ratio = inputs.trainer_logprobs - inputs.inference_logprobs
	dppo_invalid = (log_ratio > dppo_mask_high) \| (log_ratio < dppo_mask_low)
	keep_mask = inputs.loss_mask & ~dppo_invalid
	grpo = -(inputs.advantages * inputs.trainer_logprobs * keep_mask).sum() \
	/ keep_mask.sum().clamp_min(epsilon)
	if alpha_sdpo != 0.0:
	raise NotImplementedError(
	"Channel 2 SDPO requires full-vocab logits; PRIME-RL v0.5 "
	"exposes only log-probs. Deferred to v0.2."
	)
	if beta_dpo != 0.0:
	import warnings; warnings.warn(
	"Channel 3 trace-replay DPO is out-of-scope for PRIME-RL recipe v0",
	stacklevel=2,
	)
	return grpo
	```

	Shape note (caught in the Wave 13 cross-model review): PRIME-RL
	calls the loss function once per sample; tensors are 1-D `(seq,)`,
	not batched `(B, T)`. The 10 unit tests in
	`composer_replication/recipes/prime_rl/tests/test_composer_loss.py`
	cover this plus DPPO mask edges.

	### 4. Decoupled DiLoCo wiring

	PRIME-RL was designed for decentralized training and ships its own
	weight-sync primitives. Stack DiLoCo on top via the
	`ServerlessExecutor` Protocol — each replica runs an independent
	PRIME-RL job pointing at the same `composer_loss:loss_fn`:

	```python
	from composer_replication.diloco.serverless import (
	LocalProcessExecutor, ObjectStoreAllReduce,
	)

	rendezvous = ObjectStoreAllReduce(
	uri = "s3://prime-rl-diloco/run/",
	world_size = 4,
	)
	# Wave 14: ModalExecutor is a skeleton (raises NotImplementedError until v0.x).
	# Use LocalProcessExecutor for the inner-replica wiring; swap to the cloud
	# executor once it lands. The DiLoCo + rendezvous code below is identical.
	executor = LocalProcessExecutor()
	handles = executor.launch_replicas(
	n_replicas = 4,
	entrypoint = "prime_rl.cli:main",
	entrypoint_args = {
	"config": "prime_rl_config.yaml",
	"+diloco.rendezvous": rendezvous.uri,
	"+diloco.sync_every_h": 500,
	},
	)
	executor.collect(handles, timeout=24 * 3600)
	```

	Note PRIME-RL's own multi-node story (the trainer / inference /
	orchestrator split) is orthogonal to Decoupled DiLoCo: PRIME-RL
	multi-node = single replica scaled across many GPUs; DiLoCo = N
	independent replicas synchronizing via object store. Combine both for
	"big PRIME-RL job × N replicas".

	### 5. Distillation-loss wiring

	Channel 2 (SDPO + TAID + Entropy-OPD) is deferred in v0 because
	PRIME-RL's `LossInputs` exposes log-probs not full vocab logits. The
	SimPO swap on channel 3 is also gated by the same shape constraint, but
	DPPO-clip itself doesn't change. To get TAID/SimPO into a PRIME-RL job
	today you must:

	1. Switch to Recipe 1 or 2 for the SFT/distill phase.
	2. Use PRIME-RL only for the on-policy GRPO+DPPO phase.

	The v0.2 plan (per ADR-007) is to extend `LossInputs` with a
	`teacher_logits` field; the loss adapter is already shape-ready.

	### 6. Cost ballpark

	- GPU: similar profile to Recipe 2 — 8–32 H100 typical, scales to
	hundreds for INTELLECT-class runs. Lambda Cloud or RunPod community
	H100 community pricing (~$2–4/hr per H100) is most cost-effective.
	- API: channel 3 is gated, so the only OpenRouter spend is from the
	offline data-prep spike (using the verifier harness in Recipe 1 to
	pre-bake DPO pairs), not from the training loop itself. Order of
	magnitude: $50–500 for a curriculum-bake one-time, then $0/run.
	- Network: PRIME-RL's own decentralized weight sync uses substantial
	bandwidth between training replicas (one of its design constraints);
	this is separate from the Decoupled DiLoCo bandwidth and shows up
	as a ceiling on cross-region replica placement.

	### 7. Known limitations as of Wave 14

	- Channel 2 deferred — see step 5. `alpha_sdpo > 0` raises
	`NotImplementedError`.
	- Channel 3 emits a warning if `beta_dpo != 0`; trace-replay DPO
	pairs must be folded into the training data (offline) rather than
	the loss (online) until v0.2.
	- PRIME-RL ≥ 0.5 required. Earlier versions don't ship
	`CustomLossConfig`.
	- Smoke test deferred. Per `prime_rl_recipe.md`, the runtime smoke
	test requires a CUDA box + `prime-rl >= 0.5` install and is gated
	to a follow-up spike. The 10 unit tests run cleanly without GPU.
	- DPPO defaults are PRIME-RL's, not ours. We pin `low=-4.0,
	high=4.0` to match. If you change them, you're now diverging from
	PRIME-RL's example configs.

	---

	## Recipe 4 — Serverless Decoupled DiLoCo

	### 1. When to use it

	Pick Decoupled DiLoCo when:

	- You have N independent training replicas that should sync
	occasionally but can't (or shouldn't) cross-talk on every step.
	- The cost or operational burden of an always-on multi-node cluster is
	unacceptable, but you're happy paying for 4× independent **serverless
	jobs**.
	- Your inner trainer is one of Recipes 1–3 — DiLoCo wraps any inner
	optimizer; it's purely outer-loop.
	- You need failure isolation: if one replica crashes, the others
	keep training; on restart it picks up from the last outer round.

	DiLoCo's design rests on two abstractions (per ADR-005):

	1. `ServerlessExecutor` Protocol — uniform interface for spinning up
	N replicas across cloud backends (Modal / HF Jobs / SageMaker / k8s).
	2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
	that replaces the in-process `torchft.Manager.allreduce` call.

	The communication pattern is `S3 PutObject + N GetObjects` once per
	inner-H steps, matching DiLoCo paper §3.2 (arXiv:2311.08105). For
	1B-param bf16 that's ~2 GB / 30 min per replica — well within S3
	free-tier.

	### 2. Install command

	```bash
	pip install -e ".[diloco,serverless]"
	# also one of the inner-trainer extras:
	pip install -e ".[train]" # if the inner trainer is Recipe 1
	# OR pip install verl # if the inner trainer is Recipe 2
	# OR pip install -e ".[prime-rl]" # if the inner trainer is Recipe 3
	```

	### 3. Minimum-viable Python script

	This pattern is independent of the inner trainer — pick any of Recipes
	1/2/3 and wrap it with a `ServerlessExecutor`. The replica entrypoint
	runs the inner trainer; the driver launches N of them and waits.

	```python
	# diloco_driver.py — driver that launches N replicas
	from composer_replication.diloco.serverless import (
	LocalProcessExecutor, # for dev — runs replicas as local subprocesses
	ObjectStoreAllReduce,
	)

	rendezvous = ObjectStoreAllReduce(
	uri = "s3://my-bucket/diloco-runs/run42/", # or file:// for local
	world_size = 4,
	)
	executor = LocalProcessExecutor() # Wave 14: ModalExecutor skeleton raises NotImplementedError; swap once cloud backend lands
	handles = executor.launch_replicas(
	n_replicas = 4,
	entrypoint = "diloco_replica.py", # (script below)
	entrypoint_args = {
	"rendezvous": rendezvous.uri,
	"rank_env": "REPLICA_RANK",
	},
	)
	result = executor.collect(handles, timeout=3600)
	print({h.replica_id: h.exit_code for h in result})
	```

	```python
	# diloco_replica.py — runs inside each replica
	import os
	from composer_replication.diloco import make_diloco_outer_loop
	from composer_replication.diloco.serverless import (
	ObjectStoreAllReduce, MockManager,
	)

	# Build inner trainer (Recipe 1 example):
	from train_trl import trainer

	rendezvous = ObjectStoreAllReduce(
	uri = os.environ["DILOCO_RENDEZVOUS"],
	world_size = 4,
	rank = int(os.environ["REPLICA_RANK"]),
	)
	manager = MockManager(allreduce=rendezvous)
	outer = make_diloco_outer_loop(
	inner_optimizer = trainer.optimizer,
	manager = manager,
	sync_every_h = 500,
	)
	trainer.add_callback(outer.callback())
	trainer.train()
	```

	### 4. Decoupled DiLoCo wiring

	This recipe is the DiLoCo wiring — see step 3. The available
	executor adapters are:

	\| Executor \| Status \| Use case \|
	\|---------------------------\|-------------------------------\|--------------------------------------\|
	\| `LocalProcessExecutor` \| Production-ready \| Dev loop — N subprocesses on one box \|
	\| `ModalExecutor` \| Skeleton (modal-client gated) \| Modal cloud, $/sec billing \|
	\| `HFJobsExecutor` \| Skeleton (hf-hub gated) \| HuggingFace Jobs, transformer-shop \|
	\| `SageMakerExecutor` \| Roadmap (post-v0.2) \| AWS, warm-pool ~10s cold start \|
	\| `K8sExecutor` \| Roadmap \| KubeRay / Volcano gang scheduling \|

	Cross-cloud replica placement (e.g. 2× Modal + 2× HF Jobs) is supported
	in principle — they all read/write the same S3 / GCS / HF rendezvous —
	but treat as experimental.

	### 5. Distillation-loss wiring

	DiLoCo is loss-agnostic — it operates purely on inner-optimizer state.
	Whichever inner trainer you're running (Recipe 1, 2, or 3) handles
	distillation kwargs as documented in that recipe's step 5. The only
	DiLoCo-specific knob worth knowing: TAID's `taid_schedule_step` is a
	global counter, but each replica increments it independently. If you
	care about replicas all reading the same α at outer-sync time, set
	`taid_schedule_step = trainer.state.global_step + replica_offset` and
	let the outer-loop sync average them out.

	### 6. Cost ballpark

	Pulled from
	[`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md):

	\| Backend \| A100-80GB $/hr \| H100 $/hr \| Cold-start \| Notes \|
	\|---------------\|----------------\|-----------\|------------\|------------------------------------------\|
	\| Modal \| $1.39/sec → 4× ≈ $20/hr per A100 \| ~$8/hr per H100 \| 1–60s warm, 60–120s first-run \| $/sec billing; no minimum \|
	\| AWS SageMaker \| $4.10/A100·hr \| $12.29/hr \| 2–5 min cold, ~10s warm pool \| Min 60min on warm pool \|
	\| GCP Vertex \| $3.67/A100·hr \| $11/hr \| 2–6 min cold \| 30–50% premium over raw GPU \|
	\| Azure ML \| ~$3.67/A100·hr \| ~$12.25/hr \| 3–8 min cold \| Use curated env to cut cold-start \|
	\| RunPod \| $1.19/hr (community), $2.17 (secure) \| $1.99/hr (community), $4.18 (secure) \| seconds \| No federation; same-DC only \|
	\| HF Jobs \| comparable to Modal \| ~$8–12/hr \| 30–90s \| Best DX for HF-shop \|

	Object-store cost. ~$0.02/GB-month for S3 standard, ~$0/free-tier.
	Pseudo-gradients are ~2 GB per replica per outer round; for a 24-hour
	4-replica run at H=500 that's ~50 outer rounds × 2 GB × 4 replicas = ~400
	GB written. Free-tier blows through fast — budget $10–20 in storage.

	### 7. Known limitations as of Wave 14

	- `ModalExecutor` and `HFJobsExecutor` are skeletons. They check
	`import modal` / `import huggingface_hub` at adapter init time and
	raise; the actual `launch_replicas` is shape-only until the relevant
	spike lands. Use `LocalProcessExecutor` for dev.
	- `ObjectStoreAllReduce(world_size=1)` must passthrough cleanly —
	the unit test `test_object_store_allreduce_world_size_1_passthrough`
	is the regression guard. Don't override unless you've read it.
	- Rank validation is mandatory. Tests assert
	`ObjectStoreAllReduce(rank=N, world_size=N)` raises (rank must be
	`< world_size`); silent corruption otherwise.
	- *`MockManager` is not* feature-complete.** It implements the
	`Manager.allreduce` surface that DiLoCo's outer-loop needs, but
	not the full `torchft.Manager` API (no fault-tolerance, no
	membership protocol). Don't use it as a drop-in for live torchft.
	- No native heterogeneous compute — all replicas are assumed to
	have the same compute shape. Mixed A100+H100 placements work but
	the slow replica gates outer-loop progress.

	---

	## Recipe 5 — Monarch actor mesh

	### 1. When to use it

	Pick Monarch when:

	- You're at TorchForge-style topology scale: trainer / generator /
	rewarder / N-teachers all want to be independent, asynchronously
	scheduled, fault-tolerant actors on a typed mesh.
	- You want heterogeneous executor support — different actors run
	in different clouds (e.g. `TrainerActor` on Modal A100s,
	`GeneratorActor` on dedicated H100s, `TeacherPoolActor` as 0-GPU CPU
	pods on k8s).
	- You need hot-swap of actor implementations — replace
	"OpenRouter teachers" with "local vLLM teachers" by changing one
	Monarch binding, no trainer code change.
	- You're prepared to track upstream Monarch (v0.4.1 stable, v0.5
	dev daily); the API is moving and v0 of this recipe is intentionally
	deferred per ADR-006.

	Don't pick Monarch in Wave 14 unless you're explicitly scoping a
	v0.2+ pilot. The framework ships skeleton actors that fail-fast on
	instantiation; this is a reference-pattern reading exercise, not a
	production target.

	### 2. Install command

	```bash
	pip install -e ".[prime-rl,monarch]"
	# pulls monarch>=0.4.1 plus the PRIME-RL trainer used inside actors
	```

	### 3. Minimum-viable Python script

	The framework ships skeleton actor definitions at
	`composer_replication/recipes/monarch/actors.py`; they raise
	`NotImplementedError` on instantiation in Wave 14. The shape of the
	final answer:

	```python
	# monarch_train.py — what v0.2+ usage will look like
	from monarch import Actor, mesh, endpoint
	from composer_replication.recipes.monarch.actors import (
	TrainerActor, GeneratorActor, RewarderActor, TeacherPoolActor,
	)

	# Topology
	trainers = mesh.spawn(TrainerActor, n=4, gpu="A100")
	generator = mesh.spawn(GeneratorActor, n=1, gpu="A100")
	rewarder = mesh.spawn(RewarderActor, n=1, gpu=None)
	teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

	# Wire endpoints
	async def outer_step(batch_id: int):
	prompts = await trainers[0].sample_prompts.call(batch_id)
	rollouts = await generator.rollout.call(prompts)
	rewards = await rewarder.score.call(rollouts)
	teacher_acts = await teachers.replay.call([
	{"state": r["state"]} for r in rollouts
	])
	await trainers.train_outer_step.call(
	batch_id, rollouts=rollouts, rewards=rewards,
	teacher_actions=teacher_acts,
	)

	# Run
	import asyncio
	for batch_id in range(1000):
	asyncio.run(outer_step(batch_id))
	```

	The Composer 3-channel loss lives inside `TrainerActor.train_outer_step`,
	which calls `compose_loss(...)` exactly as Recipe 1 does. The
	orchestration changes; the loss math doesn't.

	### 4. Decoupled DiLoCo wiring

	Monarch + Decoupled DiLoCo compose naturally: each `TrainerActor` is a
	DiLoCo replica, and Monarch's supervision tree handles the failure
	recovery that ADR-005 lists as a DiLoCo design constraint. The wire-up
	is identical to Recipe 4's `LocalProcessExecutor` pattern, just running
	inside Monarch instead of `subprocess`:

	```python
	from composer_replication.diloco.serverless import (
	ObjectStoreAllReduce, MockManager,
	)

	class TrainerActor(Actor):
	def __init__(self, rendezvous_uri: str, rank: int, world_size: int):
	self.rendezvous = ObjectStoreAllReduce(
	uri=rendezvous_uri, rank=rank, world_size=world_size,
	)
	self.manager = MockManager(allreduce=self.rendezvous)
	# ... build inner ComposerReplicationTrainer ...

	@endpoint
	async def train_outer_step(self, batch_id: int, **kw):
	# Inner H steps locally, then sync via self.rendezvous
	...
	```

	The "object store" is the cross-actor synchronization point that
	doesn't go through Monarch's RDMA data plane — by design, slow
	syncs (S3) and fast syncs (RDMA for in-actor weight broadcast) live on
	different planes.

	### 5. Distillation-loss wiring

	Monarch sees the loss as opaque: it lives inside `TrainerActor` and
	takes the same `compose_loss` kwargs as Recipe 1. The mesh-level
	benefit is swap-by-binding: you can replace `TeacherPoolActor`
	("OpenRouter") with a `LocalVLLMTeacherActor` to switch the
	supplier of teacher log-probs without touching the loss config.

	```python
	# Original binding — channel 3 via OpenRouter
	teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

	# Swap binding — channel 3 via local vLLM
	teachers = mesh.spawn(LocalVLLMTeacherActor, n=1, gpu="A100",
	model_id="Qwen/Qwen2.5-72B-Instruct")

	# Trainer config unchanged:
	trainer.compose_loss_kwargs = dict(
	dpo_variant = "simpo", # same as before
	sdpo_wrapper = "taid",
	taid_schedule_step = batch_id,
	taid_total_steps = 10_000,
	)
	```

	### 6. Cost ballpark

	In Wave 14: $0 (skeleton fails fast; no compute used). Projected for v0.2+:

	- Mesh overhead: Monarch's coordination plane is light — typically
	<1% of total compute even at 4-actor scale. The dominant cost is
	whatever the actors run.
	- Heterogeneous placement is the cost lever: e.g. a 4-trainer mesh
	with `TeacherPoolActor` on 0-GPU CPU pods can cut total $/hr by
	~10–20% vs forcing all actors onto GPU nodes.
	- Cluster bring-up: Monarch v0.5's Slurm backend is stable; k8s
	backend is dev-track; bare-metal SSH backend is documented.

	### 7. Known limitations as of Wave 14

	- Skeleton only, fails fast. Importing `actors.py` is fine;
	instantiating `TrainerActor(...)` raises `NotImplementedError("v0
	skeleton; deferred to v0.2 per ADR-006")`. By design.
	- Upstream Monarch API is moving. v0.4.1 stable + v0.5 dev daily
	means breaking changes are expected. Pin to a Monarch hash if you
	prototype.
	- TorchForge is paused. Per its own repo banner — don't take
	TorchForge's recipes as production patterns. Monarch alone is
	active; Forge as a layered framework is reference reading.
	- Open question (deferred): does Monarch v0.5's Slurm backend
	hand-shake cleanly with HF Jobs lifecycle? See
	`monarch_actor_layout.md` for the open-questions list.
	- Open question (deferred): can `TrainerActor` host
	`ComposerReplicationTrainer` unmodified, or does it need a
	`step_init` / `step_compute` split for Monarch's async actor model?

	---

	## Comparison matrix

	\| Dimension \| Recipe 1 — TRL \| Recipe 2 — VeRL \| Recipe 3 — PRIME-RL \| Recipe 4 — Serverless DiLoCo \| Recipe 5 — Monarch \|
	\|------------------------------------\|-----------------------------\|----------------------------------\|-----------------------------------\|------------------------------------\|-------------------------------------\|
	\| Maturity (Wave 14) \| Production-ready \| Production-ready (adapter shape-only) \| Recipe ready, runtime smoke deferred \| `LocalProcessExecutor` ready; cloud adapters skeleton \| Skeleton only; v0.2+ scope \|
	\| Supports DAPO / GRPO \| GRPO ✅; DAPO via TRL master \| GRPO ✅; DAPO ✅ (built-in) \| GRPO+DPPO ✅ (DAPO mask is the headline) \| Inherits from inner trainer \| Inherits from inner trainer \|
	\| Custom-loss extension cost (LOC) \| ~30 LOC (subclass override) \| ~50–150 LOC (registered estimator) \| ~20 LOC (single Python fn) \| 0 (transparent wrapper) \| ~30 LOC (loss inside actor) \|
	\| OpenEnv-compatible \| ✅ (HF datasets layer) \| ✅ (DataProto extension) \| ✅ (rollout JSONL contract) \| ✅ (orthogonal) \| ✅ (RewarderActor binding) \|
	\| Native multi-node \| ❌ (single-host FSDP only) \| ✅ (Ray cluster + 3D-HybridEngine) \| ✅ (trainer/inference/orchestrator split) \| ✅ (the whole point) \| ✅ (mesh of actors) \|
	\| Native Decoupled DiLoCo \| ❌ — wrap with Recipe 4 \| ❌ — wrap with Recipe 4 \| ❌ — wrap with Recipe 4 \| ✅ (this is it) \| ✅ (compose with Recipe 4 inside actor) \|
	\| License \| Apache 2.0 (TRL) \| Apache 2.0 (VeRL) \| Apache 2.0 (PRIME-RL) \| Apache 2.0 (this repo) \| BSD-3 (Monarch) \|
	\| Our recommendation (Wave 14) \| Default for ≤ 70B / single-host \| Pick at >70B if Ray is acceptable \| Pick if PRIME-Intellect / DPPO mask is required \| Stack on top of 1/2/3 for N replicas \| Reference pattern only — revisit v0.2 \|

	---

	## Cross-recipe checklist

	Regardless of which recipe you pick, these invariants are tested across
	the 115-test suite (post-Wave-15) and should be true of your wired-up system:

	- `alpha_sdpo=0` must reproduce the channel-1-only baseline
	bit-exact (`test_compose_loss_integration.py`).
	- `beta_replay=0` must reproduce the no-channel-3 baseline
	bit-exact.
	- `sdpo_wrapper="taid"` without `taid_schedule_step` must `ValueError`
	at first step (`test_compose_loss_integration.py`).
	- `sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 0`
	must ignore the teacher signal (`test_taid_loss_alpha_zero_ignores_teacher`).
	- `sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 1`
	must equal plain SDPO (`test_taid_blended_logits_endpoints`).
	- `dpo_variant="simpo"` must be differentiable through the
	`loss-of-sigmoid` path (`test_simpo_loss_differentiable`).
	- `sdpo_wrapper="entropy_opd"` must zero out when student ≡ teacher
	(`test_entropy_aware_opd_zero_when_distributions_match`).
	- `ObjectStoreAllReduce(world_size=1)` must passthrough cleanly
	(`test_object_store_allreduce_world_size_1_passthrough`).

	If any of these fail in your wired-up system, run the corresponding
	unit test to localize: most break because a kwarg got dropped at the
	adapter boundary, not because the loss math is wrong.

	---

	## Picking a recipe — decision flow

	1. Piloting Monarch (v0.2+)? → Recipe 5.
	2. Else, need >70B / multi-host? → Recipe 2 (VeRL) if Ray is OK,
	Recipe 3 (PRIME-RL) if you're in the PRIME-Intellect / DPPO universe,
	otherwise wait for Recipe 5.
	3. Else → Recipe 1 (TRL) is the v0.0/v0.1 default.
	4. At any of 1–3, need N independent replicas / failure isolation?
	→ Stack Recipe 4 (Decoupled DiLoCo) on top.

	---

	## Pointers to source

	- Loss core: [`composer_replication/loss.py`](../composer_replication/loss.py)
	- TRL trainer: [`composer_replication/trainer/composer_trainer.py`](../composer_replication/trainer/composer_trainer.py)
	- PRIME-RL adapter:
	[`composer_replication/recipes/prime_rl/composer_loss.py`](../composer_replication/recipes/prime_rl/composer_loss.py),
	recipe doc:
	[`composer_replication/recipes/prime_rl/prime_rl_recipe.md`](../composer_replication/recipes/prime_rl/prime_rl_recipe.md)
	- Monarch skeleton:
	[`composer_replication/recipes/monarch/actors.py`](../composer_replication/recipes/monarch/actors.py),
	layout doc:
	[`composer_replication/recipes/monarch/monarch_actor_layout.md`](../composer_replication/recipes/monarch/monarch_actor_layout.md)
	- Serverless DiLoCo:
	[`composer_replication/diloco/serverless/`](../composer_replication/diloco/serverless/)
	- VeRL adapter (shape-only): `composer_replication/recipes/verl/`
	- ADRs:
	[`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
	[`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
	[`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)

	---

	File path: `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md`