File size: 41,707 Bytes
d9dd3a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a5ab7
 
 
 
 
 
d9dd3a5
 
c0a5ab7
d9dd3a5
 
 
c0a5ab7
 
 
 
d9dd3a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5add15
d9dd3a5
e5add15
 
 
d9dd3a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5add15
d9dd3a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
# INTEGRATION_RECIPES.md — Wiring the 3-channel composer loss into your RL stack

> **Status:** Wave 14 release reference. Supersedes the historical
> [`docs/INTEGRATION_ARCHITECTURE.md`](INTEGRATION_ARCHITECTURE.md) (Recipes
> A–D), which is retained as background reading for the original
> mechanism-level diagrams.
>
> **Companion docs:**
> - [`docs/USER_GUIDE.md`](USER_GUIDE.md) — narrative walk-through, sections 1–8
> - [`docs/API_REFERENCE.md`](API_REFERENCE.md) — exact kwarg signatures
> - [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) — error → fix index
> - [`docs/V3_SUBSTRATE_COVERAGE.md`](V3_SUBSTRATE_COVERAGE.md) — what each
>   substrate covers
> - [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md) —
>   why these five recipes and not others

This document is the canonical answer to **"how do I plug the 3-channel
composer loss into framework X?"** for the five frameworks the project
supports as of Wave 14:

1. [TRL `GRPOTrainer` subclass](#recipe-1--trl-grpotrainer-subclass)
2. [VeRL custom `adv_estimator` + DataProto extension](#recipe-2--verl-custom-adv_estimator--dataproto-extension)
3. [PRIME-RL custom-loss config](#recipe-3--prime-rl-customlossconfig)
4. [Serverless Decoupled DiLoCo (Modal / HF Jobs / SageMaker)](#recipe-4--serverless-decoupled-diloco)
5. [Monarch actor mesh (TorchForge-style topology)](#recipe-5--monarch-actor-mesh)

Each recipe follows the same seven-part template:

1. **When to use it** — decision criteria.
2. **Install command** — which optional extras of `composer-replication`.
3. **Minimum-viable Python script** — copy-pasteable, ≤ 60 lines.
4. **Decoupled DiLoCo wiring** — how `ServerlessExecutor` +
   `ObjectStoreAllReduce` + `MockManager` layer on top.
5. **Distillation-loss wiring** — how to switch DPO → SimPO and add TAID
   via `compose_loss(..., dpo_variant=..., sdpo_wrapper=...)` or the
   recipe's own loss-config field.
6. **Cost ballpark** — GPU $/hr + API spend, sourced from
   [`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md).
7. **Known limitations as of Wave 14**.

A cross-recipe [comparison matrix](#comparison-matrix) closes the doc.

## TL;DR — the unified loss

For any of the five recipes, the v0.1 trainer step computes:

```
total_loss = grpo_loss
           + α * sdpo_kl_loss        (channel 2 — Composer hint-distill;
                                      optional TAID or Entropy-OPD wrapper)
           + β * trace_replay_loss   (channel 3 — N-teacher DPO;
                                      switchable to SimPO)
```

This is implemented once, in
[`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
and re-used by every recipe via the kwargs documented in
[`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including
all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`,
`simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the
single source of truth in
[API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss).
The conceptual call shape is just:

```python
compose_loss(model, inputs, **kwargs)  # see API_REFERENCE.md#compose_loss for full signature
```

All five recipes below either call `compose_loss` directly or call a
thin per-framework adapter that forwards these kwargs unchanged. Each
recipe's **§5 Distillation-loss wiring** documents the kwargs *that
recipe* uses by default and why; refer back to API_REFERENCE.md for
defaults, types, and which kwargs are mutually exclusive.

---

## Recipe 1 — TRL `GRPOTrainer` subclass

### 1. When to use it

This is the **default v0.0/v0.1 path** and the one we recommend for
~99% of users today. Pick TRL when:

- Your model fits on ≤ 32 GPUs (typically ≤ 70B-param FSDP).
- You already have a HuggingFace `model` + `tokenizer` + `datasets` flow.
- You want minimum integration cost — `ComposerReplicationTrainer` is a
  single subclass override of `_compute_loss` over `trl.GRPOTrainer`,
  no Ray, no actor mesh.
- You're doing single-host (one node, possibly multi-GPU FSDP) training.

Don't pick TRL when you need >100 B-param scale, when you must async-decouple
tool calls from the GPU loop, or when a Ray cluster is already in your stack
(in which case Recipe 2 is cheaper).

### 2. Install command

```bash
pip install -e ".[train,replaysim]"
```

The `train` extra pulls `trl>=0.12`, `peft`, `accelerate`, and `datasets`.
The `replaysim` extra pulls `data-juicer` for CPU-side DPO normalization
(channel 3 cleaning step). Add `[serverless]` if you also want Decoupled
DiLoCo (see step 4).

### 3. Minimum-viable Python script

```python
# train_trl.py — minimum viable Recipe 1
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from composer_replication import ComposerReplicationTrainer

MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"  # swap for 7B once it works
model     = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dataset   = load_dataset("trl-lib/tldr", split="train[:512]")

def reward_length(completions, **_):
    return [-abs(len(c) - 64) for c in completions]

trainer = ComposerReplicationTrainer(
    model         = model,
    processing_class = tokenizer,
    reward_funcs  = [reward_length],
    train_dataset = dataset,
    # Composer extras (defaults shown):
    alpha_sdpo       = 0.1,
    beta_replay      = 0.05,
    sdpo_jsd_beta    = 0.5,
    sdpo_temperature = 1.0,
    sdpo_token_clip  = None,
    replay_dpo_beta  = 0.1,
)
trainer.train()
```

Channels 2 and 3 **auto-disable per step** when their inputs aren't
present in the batch (e.g. batches with no error sites get
`sdpo_kl=0`). Set `alpha_sdpo=0` / `beta_replay=0` to disable globally
for ablations.

### 4. Decoupled DiLoCo wiring

`ComposerReplicationTrainer` is a single-process trainer. To run N
replicas of it under Decoupled DiLoCo, layer the serverless stack on the
outside: each replica runs the script above; `MockManager` stands in for
`torchft.Manager` on the inner loop and `ObjectStoreAllReduce` runs the
outer-loop pseudo-gradient exchange:

```python
# diloco_replica.py — what each of the N replicas runs
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce, MockManager,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://my-bucket/diloco-runs/run42/",
    world_size = 4,
    rank       = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
# trainer.optimizer is the *inner* optimizer; the outer is built here:
outer = make_diloco_outer_loop(
    inner_optimizer = trainer.optimizer,
    manager         = manager,
    sync_every_h    = 500,
)
trainer.add_callback(outer.callback())   # syncs every H inner steps
trainer.train()
```

The driver process spins these up with any `ServerlessExecutor`:

```python
# Wave 14: ModalExecutor / HFJobsExecutor are skeletons (raise NotImplementedError);
# use LocalProcessExecutor for testing. Swap once the cloud backends land.
executor = LocalProcessExecutor()
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "diloco_replica.py",
    entrypoint_args = {"rendezvous": rendezvous.uri,
                       "rank_env":   "REPLICA_RANK"},
)
result = executor.collect(handles, timeout=3600)
```

### 5. Distillation-loss wiring

`ComposerReplicationTrainer` exposes the new ADR-007 channels via the
shared `compose_loss` kwargs — pass them through `**kwargs` on the
trainer and they're forwarded to `compose_loss`:

```python
trainer = ComposerReplicationTrainer(
    model = model, processing_class = tokenizer,
    reward_funcs = [reward_length], train_dataset = dataset,
    # SimPO instead of DPO for channel 3:
    dpo_variant      = "simpo",
    simpo_beta       = 2.0,
    simpo_gamma      = 1.0,
    # TAID for channel 2 (SakanaAI port; logit-space mix + forward-KL):
    sdpo_wrapper       = "taid",
    taid_t             = 0.4,        # current TAID coeff in [0, 1];
                                     # drive from TAIDScheduler if you want
                                     # the paper's adaptive scheme
)
```

Or, equivalently, drop `entropy_opd` in for `taid` if you want
per-token entropy-gated forward/reverse KL instead of the
linear-blend interpolation. SimPO does **not** require reference
log-probs (channel 3 batches with `dpo_chosen_ref_logprobs` /
`dpo_rejected_ref_logprobs` set are silently ignored).

### 6. Cost ballpark

- **GPU**: single host, `g5.12xlarge` ($5.67/hr) or RunPod 4×A100-80GB
  (~$5–9/hr) gets you Qwen2.5-7B at moderate throughput. For Qwen2.5-72B
  you'll want 2–4× H100 — `p5.48xlarge` (~$98/hr on AWS, ~$25–30/hr on
  Lambda Cloud / RunPod community).
- **API**: channel 3 teacher replay via OpenRouter — verified
  ~$0.98/trace at 50 steps × 3 teachers (spike 001). For a 100-trace
  curriculum that's ~$100 in teacher tokens.
- **Storage**: negligible until you turn on DiLoCo (then see Recipe 4).

### 7. Known limitations as of Wave 14

- **Tool calls block the GPU.** TRL's rollout is synchronous; long
  tool-call latency idles the trainer. Async-decouple via Recipe 2/3/5
  if this matters.
- **No native multi-node.** TRL is single-process; multi-host scaling is
  via Decoupled DiLoCo (Recipe 4) on top, not via TRL itself.
- **vLLM weight sync is co-located** — no resharding between FSDP and TP.
  At 70B+ this becomes the bottleneck and you should move to Recipe 2.
- **`reward_funcs` must be Python callables** that return `list[float]`;
  shell-out reward graders need a wrapper.

---

## Recipe 2 — VeRL custom `adv_estimator` + DataProto extension

### 1. When to use it

Pick VeRL when:

- You need >70B-param scale or >32-GPU multi-host, *and* a Ray cluster
  is acceptable in your stack.
- You're already using or willing to adopt **3D-HybridEngine** for
  efficient FSDP↔TP weight resharding (verified ~5× weight-sync speed-up
  vs co-located vLLM at 70B+).
- You need async multi-turn rollouts where tool-call latency must not
  block the GPU loop. VeRL's `AsyncServer` + `AgentLoop` is the
  best-in-class option here.
- You want extension points the framework's authors *expect* third
  parties to use — the `@register_adv_est("...")` decorator and the
  `DataProto` extension contract are first-class APIs.

Don't pick VeRL if you're <7B-param or single-host (overkill —
Recipe 1's Trainer subclass is one file, not a Ray cluster).

### 2. Install command

```bash
pip install -e ".[replaysim]"
pip install verl                         # not packaged as an extra; pinned at >=0.3
# Optional, for the Composer adapter:
pip install -e ".[serverless]"           # for Decoupled DiLoCo on top
```

The framework's verl adapter lives at
`composer_replication.recipes.verl` (currently shape-only — see
[Limitations](#7-known-limitations-as-of-wave-14-2) below).

### 3. Minimum-viable Python script

VeRL's actual entry point is a Hydra/YAML config + `verl.trainer.main_ppo`
CLI; the pythonic surface looks like this:

```python
# train_verl.py — minimum viable Recipe 2 sketch
from verl.trainer.ppo import core_algos
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
from composer_replication.loss import compose_loss

@core_algos.register_adv_est("grpo_composer")
def composer_advantage(data, **kwargs):
    """Custom adv-estimator that adds SDPO + DPO channels to GRPO.

    Reads three extra DataProto keys (populated by the data prep step):
      - data.batch["sdpo_teacher_logits"]    (channel 2)
      - data.non_tensor_batch["teacher_actions"]  (channel 3)
    and returns the standard (advantages, returns) tuple plus a stashed
    composer-loss term consumed by the critic worker.
    """
    advantages, returns = core_algos.compute_grpo_outcome_advantage(data, **kwargs)
    composer_term = compose_loss(
        model        = kwargs["actor_module"],
        inputs       = data.batch,
        alpha_sdpo   = 0.1,
        beta_replay  = 0.05,
        dpo_variant  = "dpo",
        sdpo_wrapper = "none",
    )
    data.meta_info["composer_loss"] = composer_term
    return advantages, returns

# Then in your YAML:
#   algorithm:
#     adv_estimator: grpo_composer
# and run: python -m verl.trainer.main_ppo --config-name composer_grpo
```

The full driver wires `RayPPOTrainer` against your config; consult VeRL's
own quickstart for the Ray-cluster boilerplate. The composer-specific
piece is just the registered estimator above.

### 4. Decoupled DiLoCo wiring

VeRL's actor workers run in Ray; DiLoCo replicates the **whole VeRL job**.
Each "replica" is one Ray cluster running Recipe 2 end-to-end; the outer
loop is independent of Ray and just exchanges pseudo-gradients via the
object store between Ray-job invocations:

```python
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://verl-diloco/run/",
    world_size = 4,
)
executor = LocalProcessExecutor()        # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError) — keep LocalProcessExecutor for now
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "verl.trainer.main_ppo",
    entrypoint_args = {
        "+algorithm.adv_estimator":      "grpo_composer",
        "+algorithm.diloco.rendezvous":  rendezvous.uri,
        "+algorithm.diloco.sync_every_h": 500,
    },
)
executor.collect(handles, timeout=24 * 3600)
```

The Ray cluster inside each replica handles intra-replica scaling
(FSDP / TP / vLLM); the object-store exchange handles cross-replica
sync. Bandwidth is identical to Recipe 1 (~2 GB / 30 min per replica
for a 7B-param model in bf16) and well within S3 free-tier.

### 5. Distillation-loss wiring

The custom `adv_estimator` from step 3 already calls `compose_loss`;
flip the kwargs there to switch DPO → SimPO or add TAID:

```python
composer_term = compose_loss(
    model        = kwargs["actor_module"],
    inputs       = data.batch,
    alpha_sdpo   = 0.1,
    beta_replay  = 0.05,
    dpo_variant  = "simpo",         # ← SimPO swap
    simpo_beta   = 2.0,
    simpo_gamma  = 1.0,
    sdpo_wrapper       = "taid",    # ← TAID wrap
    taid_schedule_step = data.meta_info.get("global_step", 0),
    taid_total_steps   = 10_000,
)
```

VeRL's `data.meta_info` carries the global step automatically, which is
exactly what TAID's interpolation schedule needs. Channel 2 batches
without `student_init_logits` / `student_init_input_ids` are auto-skipped
(returns 0 for that step).

### 6. Cost ballpark

- **GPU**: 8× H100 (`p5.48xlarge` ~$98/hr on AWS, ~$25/hr on Lambda or
  RunPod community) is the entry point for 70B-class. Expect 32–256
  H100 for full 671B (matches DeepSeek's reported VeRL config).
- **API**: same ~$0.98/trace as Recipe 1 (channel 3 is a Python helper,
  not a VeRL primitive — costs are framework-independent).
- **Ray cluster overhead**: head node + redis + dashboard adds ~1
  CPU-instance ($0.10–0.50/hr) per cluster, negligible at GPU scale.

### 7. Known limitations as of Wave 14

- **`composer_replication.recipes.verl` is shape-only.** The decorator
  registration and DataProto extension are documented but not yet shipped
  as a runnable adapter — Wave 14 release exposes the *contract*, not the
  glue. Expect this to land in a v0.2 follow-up spike.
- **Ray dependency.** Adds a heavyweight runtime; debugging
  cross-actor crashes can be painful. Use VeRL's `--debug` mode early.
- **Custom-`adv_estimator` LOC**: writing your own takes ~50–150 LOC
  including DataProto plumbing. Not a one-liner.
- **No first-class TAID hook in VeRL itself** — we route TAID through
  the meta_info channel; this works but means you can't use VeRL's
  built-in checkpoint-replay tooling without re-stamping `taid_schedule_step`
  on each replay.

---

## Recipe 3 — PRIME-RL `CustomLossConfig`

### 1. When to use it

Pick PRIME-RL when:

- You're operating in the **PRIME-Intellect / decentralized training**
  universe and want INTELLECT-style scaling on a long-horizon training
  run.
- You need **DPPO importance-ratio masking** (the rationale most users
  arrive with) — PRIME-RL's headline contribution is the
  out-of-band-token *mask* (not clip) on `log_ratio = trainer_lp -
  inference_lp`, with defaults `low=-4.0, high=4.0`.
- You want a **first-class custom-loss surface**: PRIME-RL ships
  `CustomLossConfig` that takes an importable Python function and a
  `LossInputs` struct exposing exactly the tensors we need
  (`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
  `advantages`, `loss_mask`). No fork, no Trainer subclass, no monkey-patch.
- You have access to multi-node infrastructure that PRIME-RL's
  trainer/inference/orchestrator split is designed for.

Don't pick PRIME-RL if you need full vocab logits (channel 2 SDPO
requires logits not log-probs — see Limitations).

### 2. Install command

```bash
pip install -e ".[prime-rl,replaysim]"
# pulls prime-rl>=0.5
```

### 3. Minimum-viable Python script

PRIME-RL drives via YAML config; the only Python you write is the
custom-loss function (already shipped at
`composer_replication/recipes/prime_rl/composer_loss.py`). Wire it in:

```yaml
# prime_rl_config.yaml — point at the framework's adapter
loss:
  custom:
    import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn
    kwargs:
      alpha_sdpo:     0.0       # channel 2 deferred in v0 (see below)
      beta_dpo:       0.0       # channel 3 emits a warning if non-zero
      dppo_mask_high: 4.0       # PRIME-RL DPPO mask bounds
      dppo_mask_low: -4.0
      epsilon:        1.0e-6

trainer:
  model: Qwen/Qwen2.5-7B-Instruct
  ...                           # standard PRIME-RL fields
```

The shipped `loss_fn` signature is fixed by PRIME-RL's contract:

```python
def loss_fn(
    inputs: LossInputs,
    *,
    alpha_sdpo: float = 0.0,
    beta_dpo:   float = 0.0,
    dppo_mask_high: float = 4.0,
    dppo_mask_low:  float = -4.0,
    epsilon:        float = 1e-6,
) -> torch.Tensor:
    log_ratio    = inputs.trainer_logprobs - inputs.inference_logprobs
    dppo_invalid = (log_ratio > dppo_mask_high) | (log_ratio < dppo_mask_low)
    keep_mask    = inputs.loss_mask & ~dppo_invalid
    grpo = -(inputs.advantages * inputs.trainer_logprobs * keep_mask).sum() \
            / keep_mask.sum().clamp_min(epsilon)
    if alpha_sdpo != 0.0:
        raise NotImplementedError(
            "Channel 2 SDPO requires full-vocab logits; PRIME-RL v0.5 "
            "exposes only log-probs. Deferred to v0.2."
        )
    if beta_dpo != 0.0:
        import warnings; warnings.warn(
            "Channel 3 trace-replay DPO is out-of-scope for PRIME-RL recipe v0",
            stacklevel=2,
        )
    return grpo
```

**Shape note** (caught in the Wave 13 cross-model review): PRIME-RL
calls the loss function **once per sample**; tensors are 1-D `(seq,)`,
*not* batched `(B, T)`. The 10 unit tests in
`composer_replication/recipes/prime_rl/tests/test_composer_loss.py`
cover this plus DPPO mask edges.

### 4. Decoupled DiLoCo wiring

PRIME-RL was designed for decentralized training and ships its own
weight-sync primitives. Stack DiLoCo on top via the
`ServerlessExecutor` Protocol — each replica runs an independent
PRIME-RL job pointing at the same `composer_loss:loss_fn`:

```python
from composer_replication.diloco.serverless import (
    LocalProcessExecutor, ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://prime-rl-diloco/run/",
    world_size = 4,
)
# Wave 14: ModalExecutor is a skeleton (raises NotImplementedError until v0.x).
# Use LocalProcessExecutor for the inner-replica wiring; swap to the cloud
# executor once it lands. The DiLoCo + rendezvous code below is identical.
executor = LocalProcessExecutor()
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "prime_rl.cli:main",
    entrypoint_args = {
        "config":               "prime_rl_config.yaml",
        "+diloco.rendezvous":   rendezvous.uri,
        "+diloco.sync_every_h": 500,
    },
)
executor.collect(handles, timeout=24 * 3600)
```

Note PRIME-RL's own multi-node story (the trainer / inference /
orchestrator split) is **orthogonal** to Decoupled DiLoCo: PRIME-RL
multi-node = single replica scaled across many GPUs; DiLoCo = N
independent replicas synchronizing via object store. Combine both for
"big PRIME-RL job × N replicas".

### 5. Distillation-loss wiring

Channel 2 (SDPO + TAID + Entropy-OPD) is **deferred** in v0 because
PRIME-RL's `LossInputs` exposes log-probs not full vocab logits. The
SimPO swap on channel 3 is also gated by the same shape constraint, but
DPPO-clip itself doesn't change. To get TAID/SimPO into a PRIME-RL job
today you must:

1. Switch to Recipe 1 or 2 for the SFT/distill phase.
2. Use PRIME-RL only for the on-policy GRPO+DPPO phase.

The v0.2 plan (per ADR-007) is to extend `LossInputs` with a
`teacher_logits` field; the loss adapter is already shape-ready.

### 6. Cost ballpark

- **GPU**: similar profile to Recipe 2 — 8–32 H100 typical, scales to
  hundreds for INTELLECT-class runs. Lambda Cloud or RunPod community
  H100 community pricing (~$2–4/hr per H100) is most cost-effective.
- **API**: channel 3 is gated, so the only OpenRouter spend is from the
  *offline data-prep* spike (using the verifier harness in Recipe 1 to
  pre-bake DPO pairs), not from the training loop itself. Order of
  magnitude: $50–500 for a curriculum-bake one-time, then $0/run.
- **Network**: PRIME-RL's own decentralized weight sync uses substantial
  bandwidth between training replicas (one of its design constraints);
  this is *separate* from the Decoupled DiLoCo bandwidth and shows up
  as a ceiling on cross-region replica placement.

### 7. Known limitations as of Wave 14

- **Channel 2 deferred** — see step 5. `alpha_sdpo > 0` raises
  `NotImplementedError`.
- **Channel 3 emits a warning** if `beta_dpo != 0`; trace-replay DPO
  pairs must be folded into the *training data* (offline) rather than
  the *loss* (online) until v0.2.
- **PRIME-RL ≥ 0.5 required.** Earlier versions don't ship
  `CustomLossConfig`.
- **Smoke test deferred.** Per `prime_rl_recipe.md`, the runtime smoke
  test requires a CUDA box + `prime-rl >= 0.5` install and is gated
  to a follow-up spike. The 10 unit tests run cleanly without GPU.
- **DPPO defaults are PRIME-RL's, not ours.** We pin `low=-4.0,
  high=4.0` to match. If you change them, you're now diverging from
  PRIME-RL's example configs.

---

## Recipe 4 — Serverless Decoupled DiLoCo

### 1. When to use it

Pick Decoupled DiLoCo when:

- You have **N independent training replicas** that should sync
  occasionally but can't (or shouldn't) cross-talk on every step.
- The cost or operational burden of an always-on multi-node cluster is
  unacceptable, but you're happy paying for 4× independent **serverless
  jobs**.
- Your inner trainer is one of Recipes 1–3 — DiLoCo wraps any inner
  optimizer; it's *purely outer-loop*.
- You need **failure isolation**: if one replica crashes, the others
  keep training; on restart it picks up from the last outer round.

DiLoCo's design rests on two abstractions (per ADR-005):

1. **`ServerlessExecutor` Protocol** — uniform interface for spinning up
   N replicas across cloud backends (Modal / HF Jobs / SageMaker / k8s).
2. **`ObjectStoreAllReduce`** — fsspec-backed pseudo-gradient exchange
   that replaces the in-process `torchft.Manager.allreduce` call.

The communication pattern is `S3 PutObject + N GetObjects` once per
inner-H steps, matching DiLoCo paper §3.2 (arXiv:2311.08105). For
1B-param bf16 that's ~2 GB / 30 min per replica — well within S3
free-tier.

### 2. Install command

```bash
pip install -e ".[diloco,serverless]"
# also one of the inner-trainer extras:
pip install -e ".[train]"        # if the inner trainer is Recipe 1
# OR pip install verl            # if the inner trainer is Recipe 2
# OR pip install -e ".[prime-rl]" # if the inner trainer is Recipe 3
```

### 3. Minimum-viable Python script

This pattern is independent of the inner trainer — pick any of Recipes
1/2/3 and wrap it with a `ServerlessExecutor`. The replica entrypoint
runs the inner trainer; the driver launches N of them and waits.

```python
# diloco_driver.py — driver that launches N replicas
from composer_replication.diloco.serverless import (
    LocalProcessExecutor,         # for dev — runs replicas as local subprocesses
    ObjectStoreAllReduce,
)

rendezvous = ObjectStoreAllReduce(
    uri        = "s3://my-bucket/diloco-runs/run42/",  # or file:// for local
    world_size = 4,
)
executor = LocalProcessExecutor()                       # Wave 14: ModalExecutor skeleton raises NotImplementedError; swap once cloud backend lands
handles  = executor.launch_replicas(
    n_replicas      = 4,
    entrypoint      = "diloco_replica.py",              # (script below)
    entrypoint_args = {
        "rendezvous": rendezvous.uri,
        "rank_env":   "REPLICA_RANK",
    },
)
result = executor.collect(handles, timeout=3600)
print({h.replica_id: h.exit_code for h in result})
```

```python
# diloco_replica.py — runs inside each replica
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
    ObjectStoreAllReduce, MockManager,
)

# Build inner trainer (Recipe 1 example):
from train_trl import trainer

rendezvous = ObjectStoreAllReduce(
    uri        = os.environ["DILOCO_RENDEZVOUS"],
    world_size = 4,
    rank       = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
outer = make_diloco_outer_loop(
    inner_optimizer = trainer.optimizer,
    manager         = manager,
    sync_every_h    = 500,
)
trainer.add_callback(outer.callback())
trainer.train()
```

### 4. Decoupled DiLoCo wiring

This recipe **is** the DiLoCo wiring — see step 3. The available
executor adapters are:

| Executor                  | Status                        | Use case                             |
|---------------------------|-------------------------------|--------------------------------------|
| `LocalProcessExecutor`    | Production-ready              | Dev loop — N subprocesses on one box |
| `ModalExecutor`           | Skeleton (modal-client gated) | Modal cloud, $/sec billing           |
| `HFJobsExecutor`          | Skeleton (hf-hub gated)       | HuggingFace Jobs, transformer-shop   |
| `SageMakerExecutor`       | Roadmap (post-v0.2)           | AWS, warm-pool ~10s cold start       |
| `K8sExecutor`             | Roadmap                       | KubeRay / Volcano gang scheduling    |

Cross-cloud replica placement (e.g. 2× Modal + 2× HF Jobs) is supported
in principle — they all read/write the same S3 / GCS / HF rendezvous —
but treat as experimental.

### 5. Distillation-loss wiring

DiLoCo is loss-agnostic — it operates purely on inner-optimizer state.
Whichever inner trainer you're running (Recipe 1, 2, or 3) handles
distillation kwargs as documented in that recipe's step 5. The only
DiLoCo-specific knob worth knowing: TAID's `taid_schedule_step` is a
*global* counter, but each replica increments it independently. If you
care about replicas all reading the same α at outer-sync time, set
`taid_schedule_step = trainer.state.global_step + replica_offset` and
let the outer-loop sync average them out.

### 6. Cost ballpark

Pulled from
[`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md):

| Backend       | A100-80GB $/hr | H100 $/hr | Cold-start | Notes                                    |
|---------------|----------------|-----------|------------|------------------------------------------|
| Modal         | $1.39/sec → 4× ≈ $20/hr per A100 | ~$8/hr per H100  | 1–60s warm, 60–120s first-run | $/sec billing; no minimum |
| AWS SageMaker | $4.10/A100·hr  | $12.29/hr | 2–5 min cold, ~10s warm pool | Min 60min on warm pool |
| GCP Vertex    | $3.67/A100·hr  | $11/hr    | 2–6 min cold | 30–50% premium over raw GPU |
| Azure ML      | ~$3.67/A100·hr | ~$12.25/hr | 3–8 min cold | Use curated env to cut cold-start |
| RunPod        | $1.19/hr (community), $2.17 (secure) | $1.99/hr (community), $4.18 (secure) | seconds | No federation; same-DC only |
| HF Jobs       | comparable to Modal | ~$8–12/hr | 30–90s | Best DX for HF-shop |

**Object-store cost.** ~$0.02/GB-month for S3 standard, ~$0/free-tier.
Pseudo-gradients are ~2 GB per replica per outer round; for a 24-hour
4-replica run at H=500 that's ~50 outer rounds × 2 GB × 4 replicas = ~400
GB written. Free-tier blows through fast — budget $10–20 in storage.

### 7. Known limitations as of Wave 14

- **`ModalExecutor` and `HFJobsExecutor` are skeletons.** They check
  `import modal` / `import huggingface_hub` at *adapter init* time and
  raise; the actual `launch_replicas` is shape-only until the relevant
  spike lands. Use `LocalProcessExecutor` for dev.
- **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly —
  the unit test `test_object_store_allreduce_world_size_1_passthrough`
  is the regression guard. Don't override unless you've read it.
- **Rank validation is mandatory.** Tests assert
  `ObjectStoreAllReduce(rank=N, world_size=N)` raises (rank must be
  `< world_size`); silent corruption otherwise.
- **`MockManager` is *not* feature-complete.** It implements the
  `Manager.allreduce` surface that DiLoCo's outer-loop needs, but
  not the full `torchft.Manager` API (no fault-tolerance, no
  membership protocol). Don't use it as a drop-in for live torchft.
- **No native heterogeneous compute** — all replicas are assumed to
  have the same compute shape. Mixed A100+H100 placements work but
  the slow replica gates outer-loop progress.

---

## Recipe 5 — Monarch actor mesh

### 1. When to use it

Pick Monarch when:

- You're at **TorchForge-style topology scale**: trainer / generator /
  rewarder / N-teachers all want to be independent, asynchronously
  scheduled, fault-tolerant actors on a typed mesh.
- You want **heterogeneous executor support** — different actors run
  in different clouds (e.g. `TrainerActor` on Modal A100s,
  `GeneratorActor` on dedicated H100s, `TeacherPoolActor` as 0-GPU CPU
  pods on k8s).
- You need **hot-swap of actor implementations** — replace
  "OpenRouter teachers" with "local vLLM teachers" by changing one
  Monarch binding, no trainer code change.
- You're prepared to track **upstream Monarch** (v0.4.1 stable, v0.5
  dev daily); the API is moving and v0 of this recipe is intentionally
  deferred per ADR-006.

Don't pick Monarch in Wave 14 unless you're explicitly scoping a
v0.2+ pilot. The framework ships *skeleton* actors that fail-fast on
instantiation; this is a reference-pattern reading exercise, not a
production target.

### 2. Install command

```bash
pip install -e ".[prime-rl,monarch]"
# pulls monarch>=0.4.1 plus the PRIME-RL trainer used inside actors
```

### 3. Minimum-viable Python script

The framework ships skeleton actor definitions at
`composer_replication/recipes/monarch/actors.py`; they raise
`NotImplementedError` on instantiation in Wave 14. The shape of the
final answer:

```python
# monarch_train.py — what v0.2+ usage will look like
from monarch import Actor, mesh, endpoint
from composer_replication.recipes.monarch.actors import (
    TrainerActor, GeneratorActor, RewarderActor, TeacherPoolActor,
)

# Topology
trainers   = mesh.spawn(TrainerActor, n=4, gpu="A100")
generator  = mesh.spawn(GeneratorActor, n=1, gpu="A100")
rewarder   = mesh.spawn(RewarderActor, n=1, gpu=None)
teachers   = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

# Wire endpoints
async def outer_step(batch_id: int):
    prompts     = await trainers[0].sample_prompts.call(batch_id)
    rollouts    = await generator.rollout.call(prompts)
    rewards     = await rewarder.score.call(rollouts)
    teacher_acts = await teachers.replay.call([
        {"state": r["state"]} for r in rollouts
    ])
    await trainers.train_outer_step.call(
        batch_id, rollouts=rollouts, rewards=rewards,
        teacher_actions=teacher_acts,
    )

# Run
import asyncio
for batch_id in range(1000):
    asyncio.run(outer_step(batch_id))
```

The Composer 3-channel loss lives inside `TrainerActor.train_outer_step`,
which calls `compose_loss(...)` exactly as Recipe 1 does. The
*orchestration* changes; the *loss math* doesn't.

### 4. Decoupled DiLoCo wiring

Monarch + Decoupled DiLoCo compose naturally: each `TrainerActor` is a
DiLoCo replica, and Monarch's supervision tree handles the failure
recovery that ADR-005 lists as a DiLoCo design constraint. The wire-up
is identical to Recipe 4's `LocalProcessExecutor` pattern, just running
inside Monarch instead of `subprocess`:

```python
from composer_replication.diloco.serverless import (
    ObjectStoreAllReduce, MockManager,
)

class TrainerActor(Actor):
    def __init__(self, rendezvous_uri: str, rank: int, world_size: int):
        self.rendezvous = ObjectStoreAllReduce(
            uri=rendezvous_uri, rank=rank, world_size=world_size,
        )
        self.manager = MockManager(allreduce=self.rendezvous)
        # ... build inner ComposerReplicationTrainer ...

    @endpoint
    async def train_outer_step(self, batch_id: int, **kw):
        # Inner H steps locally, then sync via self.rendezvous
        ...
```

The "object store" is the cross-actor synchronization point that
*doesn't* go through Monarch's RDMA data plane — by design, slow
syncs (S3) and fast syncs (RDMA for in-actor weight broadcast) live on
different planes.

### 5. Distillation-loss wiring

Monarch sees the loss as opaque: it lives inside `TrainerActor` and
takes the same `compose_loss` kwargs as Recipe 1. The mesh-level
benefit is **swap-by-binding**: you can replace `TeacherPoolActor`
("OpenRouter") with a `LocalVLLMTeacherActor` to switch the
*supplier* of teacher log-probs without touching the loss config.

```python
# Original binding — channel 3 via OpenRouter
teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)

# Swap binding — channel 3 via local vLLM
teachers = mesh.spawn(LocalVLLMTeacherActor, n=1, gpu="A100",
                     model_id="Qwen/Qwen2.5-72B-Instruct")

# Trainer config unchanged:
trainer.compose_loss_kwargs = dict(
    dpo_variant      = "simpo",      # same as before
    sdpo_wrapper     = "taid",
    taid_schedule_step = batch_id,
    taid_total_steps   = 10_000,
)
```

### 6. Cost ballpark

In Wave 14: $0 (skeleton fails fast; no compute used). Projected for v0.2+:

- **Mesh overhead**: Monarch's coordination plane is light — typically
  <1% of total compute even at 4-actor scale. The dominant cost is
  whatever the actors run.
- **Heterogeneous placement** is the cost lever: e.g. a 4-trainer mesh
  with `TeacherPoolActor` on 0-GPU CPU pods can cut total $/hr by
  ~10–20% vs forcing all actors onto GPU nodes.
- **Cluster bring-up**: Monarch v0.5's Slurm backend is stable; k8s
  backend is dev-track; bare-metal SSH backend is documented.

### 7. Known limitations as of Wave 14

- **Skeleton only, fails fast.** Importing `actors.py` is fine;
  instantiating `TrainerActor(...)` raises `NotImplementedError("v0
  skeleton; deferred to v0.2 per ADR-006")`. By design.
- **Upstream Monarch API is moving.** v0.4.1 stable + v0.5 dev daily
  means breaking changes are expected. Pin to a Monarch hash if you
  prototype.
- **TorchForge is paused.** Per its own repo banner — don't take
  TorchForge's recipes as production patterns. Monarch alone is
  active; Forge as a layered framework is reference reading.
- **Open question (deferred):** does Monarch v0.5's Slurm backend
  hand-shake cleanly with HF Jobs lifecycle? See
  `monarch_actor_layout.md` for the open-questions list.
- **Open question (deferred):** can `TrainerActor` host
  `ComposerReplicationTrainer` unmodified, or does it need a
  `step_init` / `step_compute` split for Monarch's async actor model?

---

## Comparison matrix

| Dimension                          | Recipe 1 — TRL              | Recipe 2 — VeRL                  | Recipe 3 — PRIME-RL               | Recipe 4 — Serverless DiLoCo       | Recipe 5 — Monarch                  |
|------------------------------------|-----------------------------|----------------------------------|-----------------------------------|------------------------------------|-------------------------------------|
| **Maturity (Wave 14)**             | Production-ready            | Production-ready (adapter shape-only) | Recipe ready, runtime smoke deferred | `LocalProcessExecutor` ready; cloud adapters skeleton | Skeleton only; v0.2+ scope        |
| **Supports DAPO / GRPO**           | GRPO ✅; DAPO via TRL master | GRPO ✅; DAPO ✅ (built-in)       | GRPO+DPPO ✅ (DAPO mask is the headline) | Inherits from inner trainer       | Inherits from inner trainer         |
| **Custom-loss extension cost (LOC)** | ~30 LOC (subclass override) | ~50–150 LOC (registered estimator) | ~20 LOC (single Python fn)        | 0 (transparent wrapper)           | ~30 LOC (loss inside actor)         |
| **OpenEnv-compatible**             | ✅ (HF datasets layer)       | ✅ (DataProto extension)          | ✅ (rollout JSONL contract)        | ✅ (orthogonal)                    | ✅ (RewarderActor binding)          |
| **Native multi-node**              | ❌ (single-host FSDP only)   | ✅ (Ray cluster + 3D-HybridEngine) | ✅ (trainer/inference/orchestrator split) | ✅ (the *whole point*)              | ✅ (mesh of actors)                  |
| **Native Decoupled DiLoCo**        | ❌ — wrap with Recipe 4      | ❌ — wrap with Recipe 4           | ❌ — wrap with Recipe 4            | ✅ (this *is* it)                  | ✅ (compose with Recipe 4 inside actor) |
| **License**                        | Apache 2.0 (TRL)            | Apache 2.0 (VeRL)                | Apache 2.0 (PRIME-RL)             | Apache 2.0 (this repo)             | BSD-3 (Monarch)                     |
| **Our recommendation (Wave 14)**   | **Default for ≤ 70B / single-host** | Pick at >70B *if* Ray is acceptable | Pick if PRIME-Intellect / DPPO mask is required | Stack on top of 1/2/3 for N replicas | Reference pattern only — revisit v0.2 |

---

## Cross-recipe checklist

Regardless of which recipe you pick, these invariants are tested across
the 115-test suite (post-Wave-15) and should be true of your wired-up system:

- **`alpha_sdpo=0`** must reproduce the channel-1-only baseline
  bit-exact (`test_compose_loss_integration.py`).
- **`beta_replay=0`** must reproduce the no-channel-3 baseline
  bit-exact.
- **`sdpo_wrapper="taid"` without `taid_schedule_step`** must `ValueError`
  at first step (`test_compose_loss_integration.py`).
- **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 0`**
  must ignore the teacher signal (`test_taid_loss_alpha_zero_ignores_teacher`).
- **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 1`**
  must equal plain SDPO (`test_taid_blended_logits_endpoints`).
- **`dpo_variant="simpo"`** must be differentiable through the
  `loss-of-sigmoid` path (`test_simpo_loss_differentiable`).
- **`sdpo_wrapper="entropy_opd"`** must zero out when student ≡ teacher
  (`test_entropy_aware_opd_zero_when_distributions_match`).
- **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly
  (`test_object_store_allreduce_world_size_1_passthrough`).

If any of these fail in your wired-up system, run the corresponding
unit test to localize: most break because a kwarg got dropped at the
adapter boundary, not because the loss math is wrong.

---

## Picking a recipe — decision flow

1. **Piloting Monarch (v0.2+)?** → Recipe 5.
2. **Else, need >70B / multi-host?** → Recipe 2 (VeRL) if Ray is OK,
   Recipe 3 (PRIME-RL) if you're in the PRIME-Intellect / DPPO universe,
   otherwise wait for Recipe 5.
3. **Else** → Recipe 1 (TRL) is the v0.0/v0.1 default.
4. **At any of 1–3, need N independent replicas / failure isolation?**
   → Stack Recipe 4 (Decoupled DiLoCo) on top.

---

## Pointers to source

- Loss core: [`composer_replication/loss.py`](../composer_replication/loss.py)
- TRL trainer: [`composer_replication/trainer/composer_trainer.py`](../composer_replication/trainer/composer_trainer.py)
- PRIME-RL adapter:
  [`composer_replication/recipes/prime_rl/composer_loss.py`](../composer_replication/recipes/prime_rl/composer_loss.py),
  recipe doc:
  [`composer_replication/recipes/prime_rl/prime_rl_recipe.md`](../composer_replication/recipes/prime_rl/prime_rl_recipe.md)
- Monarch skeleton:
  [`composer_replication/recipes/monarch/actors.py`](../composer_replication/recipes/monarch/actors.py),
  layout doc:
  [`composer_replication/recipes/monarch/monarch_actor_layout.md`](../composer_replication/recipes/monarch/monarch_actor_layout.md)
- Serverless DiLoCo:
  [`composer_replication/diloco/serverless/`](../composer_replication/diloco/serverless/)
- VeRL adapter (shape-only): `composer_replication/recipes/verl/`
- ADRs:
  [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md),
  [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md),
  [`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md)

---

**File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md`