File size: 41,707 Bytes

# INTEGRATION_RECIPES.md — Wiring the 3-channel composer loss into your RL stack

> **Status:** Wave 14 release reference. Supersedes the historical
> [`docs/INTEGRATION_ARCHITECTURE.md`](INTEGRATION_ARCHITECTURE.md) (Recipes
> A–D), which is retained as background reading for the original
> mechanism-level diagrams.
>
> **Companion docs:**
> - [`docs/USER_GUIDE.md`](USER_GUIDE.md) — narrative walk-through, sections 1–8
> - [`docs/API_REFERENCE.md`](API_REFERENCE.md) — exact kwarg signatures
> - [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) — error → fix index
> - [`docs/V3_SUBSTRATE_COVERAGE.md`](V3_SUBSTRATE_COVERAGE.md) — what each
>   substrate covers
> - [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md) —
>   why these five recipes and not others

This document is the canonical answer to **"how do I plug the 3-channel
composer loss into framework X?"** for the five frameworks the project
supports as of Wave 14:

1. [TRL `GRPOTrainer` subclass](#recipe-1--trl-grpotrainer-subclass)
2. [VeRL custom `adv_estimator` + DataProto extension](#recipe-2--verl-custom-adv_estimator--dataproto-extension)
3. [PRIME-RL custom-loss config](#recipe-3--prime-rl-customlossconfig)
4. [Serverless Decoupled DiLoCo (Modal / HF Jobs / SageMaker)](#recipe-4--serverless-decoupled-diloco)
5. [Monarch actor mesh (TorchForge-style topology)](#recipe-5--monarch-actor-mesh)

Each recipe follows the same seven-part template:

1. **When to use it** — decision criteria.
2. **Install command** — which optional extras of `composer-replication`.
3. **Minimum-viable Python script** — copy-pasteable, ≤ 60 lines.
4. **Decoupled DiLoCo wiring** — how `ServerlessExecutor` +
   `ObjectStoreAllReduce` + `MockManager` layer on top.
5. **Distillation-loss wiring** — how to switch DPO → SimPO and add TAID
   via `compose_loss(..., dpo_variant=..., sdpo_wrapper=...)` or the
   recipe's own loss-config field.
6. **Cost ballpark** — GPU $/hr + API spend, sourced from
   [`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md).
7. **Known limitations as of Wave 14**.

A cross-recipe [comparison matrix](#comparison-matrix) closes the doc.

## TL;DR — the unified loss

For any of the five recipes, the v0.1 trainer step computes:

```
total_loss = grpo_loss
           + α * sdpo_kl_loss        (channel 2 — Composer hint-distill;
                                      optional TAID or Entropy-OPD wrapper)
           + β * trace_replay_loss   (channel 3 — N-teacher DPO;
                                      switchable to SimPO)
```

This is implemented once, in
[`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
and re-used by every recipe via the kwargs documented in
[`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including
all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`,
`simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the
single source of truth in
[API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss).
The conceptual call shape is just:

```python
compose_loss(model, inputs, **kwargs)  # see API_REFERENCE.md#compose_loss for full signature
```

All five recipes below either call `compose_loss` directly or call a
thin per-framework adapter that forwards these kwargs unchanged. Each
recipe's **§5 Distillation-loss wiring** documents the kwargs *that
recipe* uses by default and why; refer back to API_REFERENCE.md for
defaults, types, and which kwargs are mutually exclusive.

---

## Recipe 1 — TRL `GRPOTrainer` subclass

### 1. When to use it

This is the **default v0.0/v0.1 path** and the one we recommend for
~99% of users today. Pick TRL when:

- Your model fits on ≤ 32 GPUs (typically ≤ 70B-param FSDP).
- You already have a HuggingFace `model` + `tokenizer` + `datasets` flow.
- You want minimum integration cost — `ComposerReplicationTrainer` is a
  single subclass override of `_compute_loss` over `trl.GRPOTrainer`,
  no Ray, no actor mesh.
- You're doing single-host (one node, possibly multi-GPU FSDP) training.

Don't pick TRL when you need >100 B-param scale, when you must async-decouple
tool calls from the GPU loop, or when a Ray cluster is already in your stack
(in which case Recipe 2 is cheaper).

### 2. Install command

```bash
pip install -e ".[train,replaysim]"
```

The `train` extra pulls `trl>=0.12`, `peft`, `accelerate`, and `datasets`.
The `replaysim` extra pulls `data-juicer` for CPU-side DPO normalization
(channel 3 cleaning step). Add `[serverless]` if you also want Decoupled
DiLoCo (see step 4).

### 3. Minimum-viable Python script

```python
# train_trl.py — minimum viable Recipe 1
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from composer_replication import ComposerReplicationTrainer

MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"  # swap for 7B once it works
model     = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dataset   = load_dataset("trl-lib/tldr", split="train[:512]")

def reward_length(completions, **_):
    return [-abs(len(c) - 64) for c in completions]

trainer = ComposerReplicationTrainer(
    model         = model,
    processing_class = tokenizer,
    reward_funcs  = [reward_length],
    train_dataset = dataset,
    # Composer extras (defaults shown):
    alpha_sdpo       = 0.1,
    beta_replay      = 0.05,
    sdpo_jsd_beta    = 0.5,
    sdpo_temperature = 1.0,
    sdpo_token_clip  = None,
    replay_dpo_beta  = 0.1,
)
trainer.train()
```

Channels 2 and 3 **auto-disable per step** when their inputs aren't
present in the batch (e.g. batches with no error sites get
`sdpo_kl=0`). Set `alpha_sdpo=0` / `beta_replay=0` to disable globally
for ablations.

### 4. Decoupled DiLoCo wiring

`ComposerReplicationTrainer` is a single-process trainer. To run N
replicas of it under Decoupled DiLoCo, layer the serverless stack on the
outside: each replica runs the script above; `MockManager` stands in for
`torchft.Manager` on the inner loop and `ObjectStoreAllReduce` runs the
outer-loop pseudo-gradient exchange:

```python
# diloco_replica.py — what each of the N replicas runs
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce, MockManager,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://my-bucket/diloco-runs/run42/",
    world_size = 4,
    rank       = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
# trainer.optimizer is the *inner* optimizer; the outer is built here:
outer = make_diloco_outer_loop(
    inner_optimizer = trainer.optimizer,
    manager         = manager,
    sync_every_h    = 500,
)
trainer.add_callback(outer.callback())   # syncs every H inner steps
trainer.train()
```

The driver process spins these up with any `ServerlessExecutor`:

```python
# Wave 14: ModalExecutor / HFJobsExecutor are skeletons (raise NotImplementedError);
# use LocalProcessExecutor for testing. Swap once the cloud backends land.
executor = LocalProcessExecutor()
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "diloco_replica.py",
    entrypoint_args = {"rendezvous": rendezvous.uri,
                       "rank_env":   "REPLICA_RANK"},
)
result = executor.collect(handles, timeout=3600)
```

### 5. Distillation-loss wiring

`ComposerReplicationTrainer` exposes the new ADR-007 channels via the
shared `compose_loss` kwargs — pass them through `**kwargs` on the
trainer and they're forwarded to `compose_loss`:

```python
trainer = ComposerReplicationTrainer(
    model = model, processing_class = tokenizer,
    reward_funcs = [reward_length], train_dataset = dataset,
    # SimPO instead of DPO for channel 3:
    dpo_variant      = "simpo",
    simpo_beta       = 2.0,
    simpo_gamma      = 1.0,
    # TAID for channel 2 (SakanaAI port; logit-space mix + forward-KL):
    sdpo_wrapper       = "taid",
    taid_t             = 0.4,        # current TAID coeff in [0, 1];
                                     # drive from TAIDScheduler if you want
                                     # the paper's adaptive scheme
)
```

Or, equivalently, drop `entropy_opd` in for `taid` if you want
per-token entropy-gated forward/reverse KL instead of the
linear-blend interpolation. SimPO does **not** require reference
log-probs (channel 3 batches with `dpo_chosen_ref_logprobs` /
`dpo_rejected_ref_logprobs` set are silently ignored).

### 6. Cost ballpark

- **GPU**: single host, `g5.12xlarge` ($5.67/hr) or RunPod 4×A100-80GB
  (~$5–9/hr) gets you Qwen2.5-7B at moderate throughput. For Qwen2.5-72B
  you'll want 2–4× H100 — `p5.48xlarge` (~$98/hr on AWS, ~$25–30/hr on
  Lambda Cloud / RunPod community).
- **API**: channel 3 teacher replay via OpenRouter — verified
  ~$0.98/trace at 50 steps × 3 teachers (spike 001). For a 100-trace
  curriculum that's ~$100 in teacher tokens.
- **Storage**: negligible until you turn on DiLoCo (then see Recipe 4).

### 7. Known limitations as of Wave 14

- **Tool calls block the GPU.** TRL's rollout is synchronous; long
  tool-call latency idles the trainer. Async-decouple via Recipe 2/3/5
  if this matters.
- **No native multi-node.** TRL is single-process; multi-host scaling is
  via Decoupled DiLoCo (Recipe 4) on top, not via TRL itself.
- **vLLM weight sync is co-located** — no resharding between FSDP and TP.
  At 70B+ this becomes the bottleneck and you should move to Recipe 2.
- **`reward_funcs` must be Python callables** that return `list[float]`;
  shell-out reward graders need a wrapper.

---

## Recipe 2 — VeRL custom `adv_estimator` + DataProto extension

### 1. When to use it

Pick VeRL when:

- You need >70B-param scale or >32-GPU multi-host, *and* a Ray cluster
  is acceptable in your stack.
- You're already using or willing to adopt **3D-HybridEngine** for
  efficient FSDP↔TP weight resharding (verified ~5× weight-sync speed-up
  vs co-located vLLM at 70B+).
- You need async multi-turn rollouts where tool-call latency must not
  block the GPU loop. VeRL's `AsyncServer` + `AgentLoop` is the
  best-in-class option here.
- You want extension points the framework's authors *expect* third
  parties to use — the `@register_adv_est("...")` decorator and the
  `DataProto` extension contract are first-class APIs.

Don't pick VeRL if you're <7B-param or single-host (overkill —
Recipe 1's Trainer subclass is one file, not a Ray cluster).

### 2. Install command

```bash
pip install -e ".[replaysim]"
pip install verl                         # not packaged as an extra; pinned at >=0.3
# Optional, for the Composer adapter:
pip install -e ".[serverless]"           # for Decoupled DiLoCo on top
```

The framework's verl adapter lives at
`composer_replication.recipes.verl` (currently shape-only — see
[Limitations](#7-known-limitations-as-of-wave-14-2) below).

### 3. Minimum-viable Python script

VeRL's actual entry point is a Hydra/YAML config + `verl.trainer.main_ppo`
CLI; the pythonic surface looks like this:

```python
# train_verl.py — minimum viable Recipe 2 sketch
from verl.trainer.ppo import core_algos
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
from composer_replication.loss import compose_loss

@core_algos.register_adv_est("grpo_composer")
def composer_advantage(data, **kwargs):
    """Custom adv-estimator that adds SDPO + DPO channels to GRPO.

    Reads three extra DataProto keys (populated by the data prep step):
      - data.batch["sdpo_teacher_logits"]    (channel 2)
      - data.non_tensor_batch["teacher_actions"]  (channel 3)
    and returns the standard (advantages, returns) tuple plus a stashed
    composer-loss term consumed by the critic worker.
    """
    advantages, returns = core_algos.compute_grpo_outcome_advantage(data, **kwargs)
    composer_term = compose_loss(
        model        = kwargs["actor_module"],
        inputs       = data.batch,
        alpha_sdpo   = 0.1,
        beta_replay  = 0.05,
        dpo_variant  = "dpo",
        sdpo_wrapper = "none",
    )
    data.meta_info["composer_loss"] = composer_term
    return advantages, returns

# Then in your YAML:
#   algorithm:
#     adv_estimator: grpo_composer
# and run: python -m verl.trainer.main_ppo --config-name composer_grpo
```

The full driver wires `RayPPOTrainer` against your config; consult VeRL's
own quickstart for the Ray-cluster boilerplate. The composer-specific
piece is just the registered estimator above.

### 4. Decoupled DiLoCo wiring

VeRL's actor workers run in Ray; DiLoCo replicates the **whole VeRL job**.
Each "replica" is one Ray cluster running Recipe 2 end-to-end; the outer
loop is independent of Ray and just exchanges pseudo-gradients via the
object store between Ray-job invocations:

```python
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://verl-diloco/run/",
    world_size = 4,
)
executor = LocalProcessExecutor()        # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError) — keep LocalProcessExecutor for now
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "verl.trainer.main_ppo",
    entrypoint_args = {
        "+algorithm.adv_estimator":      "grpo_composer",
        "+algorithm.diloco.rendezvous":  rendezvous.uri,
        "+algorithm.diloco.sync_every_h": 500,
    },
)
executor.collect(handles, timeout=24 * 3600)
```

The Ray cluster inside each replica handles intra-replica scaling
(FSDP / TP / vLLM); the object-store exchange handles cross-replica
sync. Bandwidth is identical to Recipe 1 (~2 GB / 30 min per replica
for a 7B-param model in bf16) and well within S3 free-tier.

### 5. Distillation-loss wiring

The custom `adv_estimator` from step 3 already calls `compose_loss`;
flip the kwargs there to switch DPO → SimPO or add TAID:

```python
composer_term = compose_loss(
    model        = kwargs["actor_module"],
    inputs       = data.batch,
    alpha_sdpo   = 0.1,
    beta_replay  = 0.05,
    dpo_variant  = "simpo",         # ← SimPO swap
    simpo_beta   = 2.0,
    simpo_gamma  = 1.0,
    sdpo_wrapper       = "taid",    # ← TAID wrap
    taid_schedule_step = data.meta_info.get("global_step", 0),
    taid_total_steps   = 10_000,
)
```

VeRL's `data.meta_info` carries the global step automatically, which is
exactly what TAID's interpolation schedule needs. Channel 2 batches
without `student_init_logits` / `student_init_input_ids` are auto-skipped
(returns 0 for that step).

### 6. Cost ballpark

- **GPU**: 8× H100 (`p5.48xlarge` ~$98/hr on AWS, ~$25/hr on Lambda or
  RunPod community) is the entry point for 70B-class. Expect 32–256
  H100 for full 671B (matches DeepSeek's reported VeRL config).
- **API**: same ~$0.98/trace as Recipe 1 (channel 3 is a Python helper,
  not a VeRL primitive — costs are framework-independent).
- **Ray cluster overhead**: head node + redis + dashboard adds ~1
  CPU-instance ($0.10–0.50/hr) per cluster, negligible at GPU scale.

### 7. Known limitations as of Wave 14

- **`composer_replication.recipes.verl` is shape-only.** The decorator
  registration and DataProto extension are documented but not yet shipped
  as a runnable adapter — Wave 14 release exposes the *contract*, not the
  glue. Expect this to land in a v0.2 follow-up spike.
- **Ray dependency.** Adds a heavyweight runtime; debugging
  cross-actor crashes can be painful. Use VeRL's `--debug` mode early.
- **Custom-`adv_estimator` LOC**: writing your own takes ~50–150 LOC
  including DataProto plumbing. Not a one-liner.
- **No first-class TAID hook in VeRL itself** — we route TAID through
  the meta_info channel; this works but means you can't use VeRL's
  built-in checkpoint-replay tooling without re-stamping `taid_schedule_step`
  on each replay.

---

## Recipe 3 — PRIME-RL `CustomLossConfig`

### 1. When to use it

Pick PRIME-RL when:

- You're operating in the **PRIME-Intellect / decentralized training**
  universe and want INTELLECT-style scaling on a long-horizon training
  run.
- You need **DPPO importance-ratio masking** (the rationale most users
  arrive with) — PRIME-RL's headline contribution is the
  out-of-band-token *mask* (not clip) on `log_ratio = trainer_lp -
  inference_lp`, with defaults `low=-4.0, high=4.0`.
- You want a **first-class custom-loss surface**: PRIME-RL ships
  `CustomLossConfig` that takes an importable Python function and a
  `LossInputs` struct exposing exactly the tensors we need
  (`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
  `advantages`, `loss_mask`). No fork, no Trainer subclass, no monkey-patch.
- You have access to multi-node infrastructure that PRIME-RL's
  trainer/inference/orchestrator split is designed for.

Don't pick PRIME-RL if you need full vocab logits (channel 2 SDPO
requires logits not log-probs — see Limitations).

### 2. Install command

```bash
pip install -e ".[prime-rl,replaysim]"
# pulls prime-rl>=0.5
```

### 3. Minimum-viable Python script

PRIME-RL drives via YAML config; the only Python you write is the
custom-loss function (already shipped at
`composer_replication/recipes/prime_rl/composer_loss.py`). Wire it in:

```yaml
# prime_rl_config.yaml — point at the framework's adapter
loss:
  custom:
    import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn
    kwargs:
      alpha_sdpo:     0.0       # channel 2 deferred in v0 (see below)
      beta_dpo:       0.0       # channel 3 emits a warning if non-zero
      dppo_mask_high: 4.0       # PRIME-RL DPPO mask bounds
      dppo_mask_low: -4.0
      epsilon:        1.0e-6

trainer:
  model: Qwen/Qwen2.5-7B-Instruct
  ...                           # standard PRIME-RL fields
```

The shipped `loss_fn` signature is fixed by PRIME-RL's contract:

```python
def loss_fn(
    inputs: LossInputs,
    *,
    alpha_sdpo: float = 0.0,
    beta_dpo:   float = 0.0,
    dppo_mask_high: float = 4.0,
    dppo_mask_low:  float = -4.0,
    epsilon:        float = 1e-6,
) -> torch.Tensor:
    log_ratio    = inputs.trainer_logprobs - inputs.inference_logprobs
    dppo_invalid = (log_ratio > dppo_mask_high) | (log_ratio < dppo_mask_low)
    keep_mask    = inputs.loss_mask & ~dppo_invalid
    grpo = -(inputs.advantages * inputs.trainer_logprobs * keep_mask).sum() \
            / keep_mask.sum().clamp_min(epsilon)
    if alpha_sdpo != 0.0:
        raise NotImplementedError(
            "Channel 2 SDPO requires full-vocab logits; PRIME-RL v0.5 "
            "exposes only log-probs. Deferred to v0.2."
        )
    if beta_dpo != 0.0:
        import warnings; warnings.warn(
            "Channel 3 trace-replay DPO is out-of-scope for PRIME-RL recipe v0",
            stacklevel=2,
        )
    return grpo
```

**Shape note** (caught in the Wave 13 cross-model review): PRIME-RL
calls the loss function **once per sample**; tensors are 1-D `(seq,)`,
*not* batched `(B, T)`. The 10 unit tests in
`composer_replication/recipes/prime_rl/tests/test_composer_loss.py`
cover this plus DPPO mask edges.

### 4. Decoupled DiLoCo wiring

PRIME-RL was designed for decentralized training and ships its own
weight-sync primitives. Stack DiLoCo on top via the
`ServerlessExecutor` Protocol — each replica runs an independent
PRIME-RL job pointing at the same `composer_loss:loss_fn`:

```python
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://prime-rl-diloco/run/",
    world_size = 4,
)
# Wave 14: ModalExecutor is a skeleton (raises NotImplementedError until v0.x).
# Use LocalProcessExecutor for the inner-replica wiring; swap to the cloud
# executor once it lands. The DiLoCo + rendezvous code below is identical.
executor = LocalProcessExecutor()
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "prime_rl.cli:main",
    entrypoint_args = {
        "config":               "prime_rl_config.yaml",
        "+diloco.rendezvous":   rendezvous.uri,
        "+diloco.sync_every_h": 500,
    },
)
executor.collect(handles, timeout=24 * 3600)
```

Note PRIME-RL's own multi-node story (the trainer / inference /
orchestrator split) is **orthogonal** to Decoupled DiLoCo: PRIME-RL
multi-node = single replica scaled across many GPUs; DiLoCo = N
independent replicas synchronizing via object store. Combine both for
"big PRIME-RL job × N replicas".

### 5. Distillation-loss wiring

Channel 2 (SDPO + TAID + Entropy-OPD) is **deferred** in v0 because
PRIME-RL's `LossInputs` exposes log-probs not full vocab logits. The
SimPO swap on channel 3 is also gated by the same shape constraint, but
DPPO-clip itself doesn't change. To get TAID/SimPO into a PRIME-RL job
today you must:

1. Switch to Recipe 1 or 2 for the SFT/distill phase.
2. Use PRIME-RL only for the on-policy GRPO+DPPO phase.

The v0.2 plan (per ADR-007) is to extend `LossInputs` with a
`teacher_logits` field; the loss adapter is already shape-ready.

### 6. Cost ballpark

- **GPU**: similar profile to Recipe 2 — 8–32 H100 typical, scales to
  hundreds for INTELLECT-class runs. Lambda Cloud or RunPod community
  H100 community pricing (~$2–4/hr per H100) is most cost-effective.
- **API**: channel 3 is gated, so the only OpenRouter spend is from the
  *offline data-prep* spike (using the verifier harness in Recipe 1 to
  pre-bake DPO pairs), not from the training loop itself. Order of
  magnitude: $50–500 for a curriculum-bake one-time, then $0/run.
- **Network**: PRIME-RL's own decentralized weight sync uses substantial
  bandwidth between training replicas (one of its design constraints);
  this is *separate* from the Decoupled DiLoCo bandwidth and shows up
  as a ceiling on cross-region replica placement.

### 7. Known limitations as of Wave 14

- **Channel 2 deferred** — see step 5. `alpha_sdpo > 0` raises
  `NotImplementedError`.
- **Channel 3 emits a warning** if `beta_dpo != 0`; trace-replay DPO
  pairs must be folded into the *training data* (offline) rather than
  the *loss* (online) until v0.2.
- **PRIME-RL ≥ 0.5 required.** Earlier versions don't ship
  `CustomLossConfig`.
- **Smoke test deferred.** Per `prime_rl_recipe.md`, the runtime smoke
  test requires a CUDA box + `prime-rl >= 0.5` install and is gated
  to a follow-up spike. The 10 unit tests run cleanly without GPU.
- **DPPO defaults are PRIME-RL's, not ours.** We pin `low=-4.0,
  high=4.0` to match. If you change them, you're now diverging from
  PRIME-RL's example configs.

---

## Recipe 4 — Serverless Decoupled DiLoCo

### 1. When to use it

Pick Decoupled DiLoCo when:

- You have **N independent training replicas** that should sync
  occasionally but can't (or shouldn't) cross-talk on every step.
- The cost or operational burden of an always-on multi-node cluster is
  unacceptable, but you're happy paying for 4× independent **serverless
  jobs**.
- Your inner trainer is one of Recipes 1–3 — DiLoCo wraps any inner
  optimizer; it's *purely outer-loop*.
- You need **failure isolation**: if one replica crashes, the others
  keep training; on restart it picks up from the last outer round.

DiLoCo's design rests on two abstractions (per ADR-005):

1. **`ServerlessExecutor` Protocol** — uniform interface for spinning up
   N replicas across cloud backends (Modal / HF Jobs / SageMaker / k8s).
2. **`ObjectStoreAllReduce`** — fsspec-backed pseudo-gradient exchange
   that replaces the in-process `torchft.Manager.allreduce` call.

The communication pattern is `S3 PutObject + N GetObjects` once per
inner-H steps, matching DiLoCo paper §3.2 (arXiv:2311.08105). For
1B-param bf16 that's ~2 GB / 30 min per replica — well within S3
free-tier.

### 2. Install command

```bash
pip install -e ".[diloco,serverless]"
# also one of the inner-trainer extras:
pip install -e ".[train]"        # if the inner trainer is Recipe 1
# OR pip install verl            # if the inner trainer is Recipe 2
# OR pip install -e ".[prime-rl]" # if the inner trainer is Recipe 3
```

### 3. Minimum-viable Python script

This pattern is independent of the inner trainer — pick any of Recipes
1/2/3 and wrap it with a `ServerlessExecutor`. The replica entrypoint
runs the inner trainer; the driver launches N of them and waits.

```python
# diloco_driver.py — driver that launches N replicas
from composer_replication.diloco.serverless import (
    LocalProcessExecutor,         # for dev — runs replicas as local subprocesses
    ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://my-bucket/diloco-runs/run42/",  # or file:// for local
    world_size = 4,
)
executor = LocalProcessExecutor()                       # Wave 14: ModalExecutor skeleton raises NotImplementedError; swap once cloud backend lands
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "diloco_replica.py",              # (script below)
    entrypoint_args = {
        "rendezvous": rendezvous.uri,
        "rank_env":   "REPLICA_RANK",
    },
)
result = executor.collect(handles, timeout=3600)
print({h.replica_id: h.exit_code for h in result})
```

```python
# diloco_replica.py — runs inside each replica
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
    ObjectStoreAllReduce, MockManager,
)

# Build inner trainer (Recipe 1 example):
from train_trl import trainer

rendezvous = ObjectStoreAllReduce(
    uri        = os.environ["DILOCO_RENDEZVOUS"],
    world_size = 4,
    rank       = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
outer = make_diloco_outer_loop(
    inner_optimizer = trainer.optimizer,
    manager         = manager,
    sync_every_h    = 500,
)
trainer.add_callback(outer.callback())
trainer.train()
```

### 4. Decoupled DiLoCo wiring

This recipe **is** the DiLoCo wiring — see step 3. The available
executor adapters are:

| Executor                  | Status                        | Use case                             |
|---------------------------|-------------------------------|--------------------------------------|
| `LocalProcessExecutor`    | Production-ready              | Dev loop — N subprocesses on one box |
| `ModalExecutor`           | Skeleton (modal-client gated) | Modal cloud, $/sec billing           |
| `HFJobsExecutor`          | Skeleton (hf-hub gated)       | HuggingFace Jobs, transformer-shop   |
| `SageMakerExecutor`       | Roadmap (post-v0.2)           | AWS, warm-pool ~10s cold start       |
| `K8sExecutor`             | Roadmap                       | KubeRay / Volcano gang scheduling    |

Cross-cloud replica placement (e.g. 2× Modal + 2× HF Jobs) is supported
in principle — they all read/write the same S3 / GCS / HF rendezvous —
but treat as experimental.

### 5. Distillation-loss wiring

DiLoCo is loss-agnostic — it operates purely on inner-optimizer state.
Whichever inner trainer you're running (Recipe 1, 2, or 3) handles
distillation kwargs as documented in that recipe's step 5. The only
DiLoCo-specific knob worth knowing: TAID's `taid_schedule_step` is a
*global* counter, but each replica increments it independently. If you
care about replicas all reading the same α at outer-sync time, set
`taid_schedule_step = trainer.state.global_step + replica_offset` and
let the outer-loop sync average them out.

### 6. Cost ballpark

Pulled from
[`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md):

| Backend       | A100-80GB $/hr | H100 $/hr | Cold-start | Notes                                    |
|---------------|----------------|-----------|------------|------------------------------------------|
| Modal         | $1.39/sec → 4× ≈ $20/hr per A100 | ~$8/hr per H100  | 1–60s warm, 60–120s first-run | $/sec billing; no minimum |
| AWS SageMaker | $4.10/A100·hr  | $12.29/hr | 2–5 min cold, ~10s warm pool | Min 60min on warm pool |
| GCP Vertex    | $3.67/A100·hr  | $11/hr    | 2–6 min cold | 30–50% premium over raw GPU |
| Azure ML      | ~$3.67/A100·hr | ~$12.25/hr | 3–8 min cold | Use curated env to cut cold-start |
| RunPod        | $1.19/hr (community), $2.17 (secure) | $1.99/hr (community), $4.18 (secure) | seconds | No federation; same-DC only |
| HF Jobs       | comparable to Modal | ~$8–12/hr | 30–90s | Best DX for HF-shop |

**Object-store cost.** ~$0.02/GB-month for S3 standard, ~$0/free-tier.
Pseudo-gradients are ~2 GB per replica per outer round; for a 24-hour
4-replica run at H=500 that's ~50 outer rounds × 2 GB × 4 replicas = ~400
GB written. Free-tier blows through fast — budget $10–20 in storage.

### 7. Known limitations as of Wave 14

- **`ModalExecutor` and `HFJobsExecutor` are skeletons.** They check
  `import modal` / `import huggingface_hub` at *adapter init* time and
  raise; the actual `launch_replicas` is shape-only until the relevant
  spike lands. Use `LocalProcessExecutor` for dev.
- **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly —
  the unit test `test_object_store_allreduce_world_size_1_passthrough`
  is the regression guard. Don't override unless you've read it.
- **Rank validation is mandatory.** Tests assert
  `ObjectStoreAllReduce(rank=N, world_size=N)` raises (rank must be
  `< world_size`); silent corruption otherwise.
- **`MockManager` is *not* feature-complete.** It implements the
  `Manager.allreduce` surface that DiLoCo's outer-loop needs, but
  not the full `torchft.Manager` API (no fault-tolerance, no
  membership protocol). Don't use it as a drop-in for live torchft.
- **No native heterogeneous compute** — all replicas are assumed to
  have the same compute shape. Mixed A100+H100 placements work but
  the slow replica gates outer-loop progress.

---

## Recipe 5 — Monarch actor mesh

### 1. When to use it

Pick Monarch when:

- You're at **TorchForge-style topology scale**: trainer / generator /
  rewarder / N-teachers all want to be independent, asynchronously
  scheduled, fault-tolerant actors on a typed mesh.
- You want **heterogeneous executor support** — different actors run
  in different clouds (e.g. `TrainerActor` on Modal A100s,
  `GeneratorActor` on dedicated H100s, `TeacherPoolActor` as 0-GPU CPU
  pods on k8s).
- You need **hot-swap of actor implementations** — replace
  "OpenRouter teachers" with "local vLLM teachers" by changing one
  Monarch binding, no trainer code change.
- You're prepared to track **upstream Monarch** (v0.4.1 stable, v0.5
  dev daily); the API is moving and v0 of this recipe is intentionally
  deferred per ADR-006.

Don't pick Monarch in Wave 14 unless you're explicitly scoping a
v0.2+ pilot. The framework ships *skeleton* actors that fail-fast on
instantiation; this is a reference-pattern reading exercise, not a
production target.

### 2. Install command

```bash
pip install -e ".[prime-rl,monarch]"
# pulls monarch>=0.4.1 plus the PRIME-RL trainer used inside actors
```

### 3. Minimum-viable Python script

The framework ships skeleton actor definitions at
`composer_replication/recipes/monarch/actors.py`; they raise
`NotImplementedError` on instantiation in Wave 14. The shape of the
final answer:

```python
# monarch_train.py — what v0.2+ usage will look like
from monarch import Actor, mesh, endpoint
from composer_replication.recipes.monarch.actors import (
    TrainerActor, GeneratorActor, RewarderActor, TeacherPoolActor,
)

# Topology
trainers   = mesh.spawn(TrainerActor, n=4, gpu="A100")
generator  = mesh.spawn(GeneratorActor, n=1, gpu="A100")
rewarder   = mesh.spawn(RewarderActor, n=1, gpu=None)
teachers   = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

# Wire endpoints
async def outer_step(batch_id: int):
    prompts     = await trainers[0].sample_prompts.call(batch_id)
    rollouts    = await generator.rollout.call(prompts)
    rewards     = await rewarder.score.call(rollouts)
    teacher_acts = await teachers.replay.call([
        {"state": r["state"]} for r in rollouts
    ])
    await trainers.train_outer_step.call(
        batch_id, rollouts=rollouts, rewards=rewards,
        teacher_actions=teacher_acts,
    )

# Run
import asyncio
for batch_id in range(1000):
    asyncio.run(outer_step(batch_id))
```

The Composer 3-channel loss lives inside `TrainerActor.train_outer_step`,
which calls `compose_loss(...)` exactly as Recipe 1 does. The
*orchestration* changes; the *loss math* doesn't.

### 4. Decoupled DiLoCo wiring

Monarch + Decoupled DiLoCo compose naturally: each `TrainerActor` is a
DiLoCo replica, and Monarch's supervision tree handles the failure
recovery that ADR-005 lists as a DiLoCo design constraint. The wire-up
is identical to Recipe 4's `LocalProcessExecutor` pattern, just running
inside Monarch instead of `subprocess`:

```python
from composer_replication.diloco.serverless import (
    ObjectStoreAllReduce, MockManager,
)

class TrainerActor(Actor):
    def __init__(self, rendezvous_uri: str, rank: int, world_size: int):
        self.rendezvous = ObjectStoreAllReduce(
            uri=rendezvous_uri, rank=rank, world_size=world_size,
        )
        self.manager = MockManager(allreduce=self.rendezvous)
        # ... build inner ComposerReplicationTrainer ...

    @endpoint
    async def train_outer_step(self, batch_id: int, **kw):
        # Inner H steps locally, then sync via self.rendezvous
        ...
```

The "object store" is the cross-actor synchronization point that
*doesn't* go through Monarch's RDMA data plane — by design, slow
syncs (S3) and fast syncs (RDMA for in-actor weight broadcast) live on
different planes.

### 5. Distillation-loss wiring

Monarch sees the loss as opaque: it lives inside `TrainerActor` and
takes the same `compose_loss` kwargs as Recipe 1. The mesh-level
benefit is **swap-by-binding**: you can replace `TeacherPoolActor`
("OpenRouter") with a `LocalVLLMTeacherActor` to switch the
*supplier* of teacher log-probs without touching the loss config.

```python
# Original binding — channel 3 via OpenRouter
teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

# Swap binding — channel 3 via local vLLM
teachers = mesh.spawn(LocalVLLMTeacherActor, n=1, gpu="A100",
                     model_id="Qwen/Qwen2.5-72B-Instruct")

# Trainer config unchanged:
trainer.compose_loss_kwargs = dict(
    dpo_variant      = "simpo",      # same as before
    sdpo_wrapper     = "taid",
    taid_schedule_step = batch_id,
    taid_total_steps   = 10_000,
)
```

### 6. Cost ballpark

In Wave 14: $0 (skeleton fails fast; no compute used). Projected for v0.2+:

- **Mesh overhead**: Monarch's coordination plane is light — typically
  <1% of total compute even at 4-actor scale. The dominant cost is
  whatever the actors run.
- **Heterogeneous placement** is the cost lever: e.g. a 4-trainer mesh
  with `TeacherPoolActor` on 0-GPU CPU pods can cut total $/hr by
  ~10–20% vs forcing all actors onto GPU nodes.
- **Cluster bring-up**: Monarch v0.5's Slurm backend is stable; k8s
  backend is dev-track; bare-metal SSH backend is documented.

### 7. Known limitations as of Wave 14

- **Skeleton only, fails fast.** Importing `actors.py` is fine;
  instantiating `TrainerActor(...)` raises `NotImplementedError("v0
  skeleton; deferred to v0.2 per ADR-006")`. By design.
- **Upstream Monarch API is moving.** v0.4.1 stable + v0.5 dev daily
  means breaking changes are expected. Pin to a Monarch hash if you
  prototype.
- **TorchForge is paused.** Per its own repo banner — don't take
  TorchForge's recipes as production patterns. Monarch alone is
  active; Forge as a layered framework is reference reading.
- **Open question (deferred):** does Monarch v0.5's Slurm backend
  hand-shake cleanly with HF Jobs lifecycle? See
  `monarch_actor_layout.md` for the open-questions list.
- **Open question (deferred):** can `TrainerActor` host
  `ComposerReplicationTrainer` unmodified, or does it need a
  `step_init` / `step_compute` split for Monarch's async actor model?

---

## Comparison matrix

| Dimension                          | Recipe 1 — TRL              | Recipe 2 — VeRL                  | Recipe 3 — PRIME-RL               | Recipe 4 — Serverless DiLoCo       | Recipe 5 — Monarch                  |
|------------------------------------|-----------------------------|----------------------------------|-----------------------------------|------------------------------------|-------------------------------------|
| **Maturity (Wave 14)**             | Production-ready            | Production-ready (adapter shape-only) | Recipe ready, runtime smoke deferred | `LocalProcessExecutor` ready; cloud adapters skeleton | Skeleton only; v0.2+ scope        |
| **Supports DAPO / GRPO**           | GRPO ✅; DAPO via TRL master | GRPO ✅; DAPO ✅ (built-in)       | GRPO+DPPO ✅ (DAPO mask is the headline) | Inherits from inner trainer       | Inherits from inner trainer         |
| **Custom-loss extension cost (LOC)** | ~30 LOC (subclass override) | ~50–150 LOC (registered estimator) | ~20 LOC (single Python fn)        | 0 (transparent wrapper)           | ~30 LOC (loss inside actor)         |
| **OpenEnv-compatible**             | ✅ (HF datasets layer)       | ✅ (DataProto extension)          | ✅ (rollout JSONL contract)        | ✅ (orthogonal)                    | ✅ (RewarderActor binding)          |
| **Native multi-node**              | ❌ (single-host FSDP only)   | ✅ (Ray cluster + 3D-HybridEngine) | ✅ (trainer/inference/orchestrator split) | ✅ (the *whole point*)              | ✅ (mesh of actors)                  |
| **Native Decoupled DiLoCo**        | ❌ — wrap with Recipe 4      | ❌ — wrap with Recipe 4           | ❌ — wrap with Recipe 4            | ✅ (this *is* it)                  | ✅ (compose with Recipe 4 inside actor) |
| **License**                        | Apache 2.0 (TRL)            | Apache 2.0 (VeRL)                | Apache 2.0 (PRIME-RL)             | Apache 2.0 (this repo)             | BSD-3 (Monarch)                     |
| **Our recommendation (Wave 14)**   | **Default for ≤ 70B / single-host** | Pick at >70B *if* Ray is acceptable | Pick if PRIME-Intellect / DPPO mask is required | Stack on top of 1/2/3 for N replicas | Reference pattern only — revisit v0.2 |

---

## Cross-recipe checklist

Regardless of which recipe you pick, these invariants are tested across
the 115-test suite (post-Wave-15) and should be true of your wired-up system:

- **`alpha_sdpo=0`** must reproduce the channel-1-only baseline
  bit-exact (`test_compose_loss_integration.py`).
- **`beta_replay=0`** must reproduce the no-channel-3 baseline
  bit-exact.
- **`sdpo_wrapper="taid"` without `taid_schedule_step`** must `ValueError`
  at first step (`test_compose_loss_integration.py`).
- **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 0`**
  must ignore the teacher signal (`test_taid_loss_alpha_zero_ignores_teacher`).
- **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 1`**
  must equal plain SDPO (`test_taid_blended_logits_endpoints`).
- **`dpo_variant="simpo"`** must be differentiable through the
  `loss-of-sigmoid` path (`test_simpo_loss_differentiable`).
- **`sdpo_wrapper="entropy_opd"`** must zero out when student ≡ teacher
  (`test_entropy_aware_opd_zero_when_distributions_match`).
- **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly
  (`test_object_store_allreduce_world_size_1_passthrough`).

If any of these fail in your wired-up system, run the corresponding
unit test to localize: most break because a kwarg got dropped at the
adapter boundary, not because the loss math is wrong.

---

## Picking a recipe — decision flow

1. **Piloting Monarch (v0.2+)?** → Recipe 5.
2. **Else, need >70B / multi-host?** → Recipe 2 (VeRL) if Ray is OK,
   Recipe 3 (PRIME-RL) if you're in the PRIME-Intellect / DPPO universe,
   otherwise wait for Recipe 5.
3. **Else** → Recipe 1 (TRL) is the v0.0/v0.1 default.
4. **At any of 1–3, need N independent replicas / failure isolation?**
   → Stack Recipe 4 (Decoupled DiLoCo) on top.

---

## Pointers to source

- Loss core: [`composer_replication/loss.py`](../composer_replication/loss.py)
- TRL trainer: [`composer_replication/trainer/composer_trainer.py`](../composer_replication/trainer/composer_trainer.py)
- PRIME-RL adapter:
  [`composer_replication/recipes/prime_rl/composer_loss.py`](../composer_replication/recipes/prime_rl/composer_loss.py),
  recipe doc:
  [`composer_replication/recipes/prime_rl/prime_rl_recipe.md`](../composer_replication/recipes/prime_rl/prime_rl_recipe.md)
- Monarch skeleton:
  [`composer_replication/recipes/monarch/actors.py`](../composer_replication/recipes/monarch/actors.py),
  layout doc:
  [`composer_replication/recipes/monarch/monarch_actor_layout.md`](../composer_replication/recipes/monarch/monarch_actor_layout.md)
- Serverless DiLoCo:
  [`composer_replication/diloco/serverless/`](../composer_replication/diloco/serverless/)
- VeRL adapter (shape-only): `composer_replication/recipes/verl/`
- ADRs:
  [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
  [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
  [`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)

---

**File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md`