Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # INTEGRATION_RECIPES.md — Wiring the 3-channel composer loss into your RL stack | |
| > **Status:** Wave 14 release reference. Supersedes the historical | |
| > [`docs/INTEGRATION_ARCHITECTURE.md`](INTEGRATION_ARCHITECTURE.md) (Recipes | |
| > A–D), which is retained as background reading for the original | |
| > mechanism-level diagrams. | |
| > | |
| > **Companion docs:** | |
| > - [`docs/USER_GUIDE.md`](USER_GUIDE.md) — narrative walk-through, sections 1–8 | |
| > - [`docs/API_REFERENCE.md`](API_REFERENCE.md) — exact kwarg signatures | |
| > - [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md) — error → fix index | |
| > - [`docs/V3_SUBSTRATE_COVERAGE.md`](V3_SUBSTRATE_COVERAGE.md) — what each | |
| > substrate covers | |
| > - [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md) — | |
| > why these five recipes and not others | |
| This document is the canonical answer to **"how do I plug the 3-channel | |
| composer loss into framework X?"** for the five frameworks the project | |
| supports as of Wave 14: | |
| 1. [TRL `GRPOTrainer` subclass](#recipe-1--trl-grpotrainer-subclass) | |
| 2. [VeRL custom `adv_estimator` + DataProto extension](#recipe-2--verl-custom-adv_estimator--dataproto-extension) | |
| 3. [PRIME-RL custom-loss config](#recipe-3--prime-rl-customlossconfig) | |
| 4. [Serverless Decoupled DiLoCo (Modal / HF Jobs / SageMaker)](#recipe-4--serverless-decoupled-diloco) | |
| 5. [Monarch actor mesh (TorchForge-style topology)](#recipe-5--monarch-actor-mesh) | |
| Each recipe follows the same seven-part template: | |
| 1. **When to use it** — decision criteria. | |
| 2. **Install command** — which optional extras of `composer-replication`. | |
| 3. **Minimum-viable Python script** — copy-pasteable, ≤ 60 lines. | |
| 4. **Decoupled DiLoCo wiring** — how `ServerlessExecutor` + | |
| `ObjectStoreAllReduce` + `MockManager` layer on top. | |
| 5. **Distillation-loss wiring** — how to switch DPO → SimPO and add TAID | |
| via `compose_loss(..., dpo_variant=..., sdpo_wrapper=...)` or the | |
| recipe's own loss-config field. | |
| 6. **Cost ballpark** — GPU $/hr + API spend, sourced from | |
| [`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md). | |
| 7. **Known limitations as of Wave 14**. | |
| A cross-recipe [comparison matrix](#comparison-matrix) closes the doc. | |
| ## TL;DR — the unified loss | |
| For any of the five recipes, the v0.1 trainer step computes: | |
| ``` | |
| total_loss = grpo_loss | |
| + α * sdpo_kl_loss (channel 2 — Composer hint-distill; | |
| optional TAID or Entropy-OPD wrapper) | |
| + β * trace_replay_loss (channel 3 — N-teacher DPO; | |
| switchable to SimPO) | |
| ``` | |
| This is implemented once, in | |
| [`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py), | |
| and re-used by every recipe via the kwargs documented in | |
| [`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including | |
| all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`, | |
| `simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the | |
| single source of truth in | |
| [API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss). | |
| The conceptual call shape is just: | |
| ```python | |
| compose_loss(model, inputs, **kwargs) # see API_REFERENCE.md#compose_loss for full signature | |
| ``` | |
| All five recipes below either call `compose_loss` directly or call a | |
| thin per-framework adapter that forwards these kwargs unchanged. Each | |
| recipe's **§5 Distillation-loss wiring** documents the kwargs *that | |
| recipe* uses by default and why; refer back to API_REFERENCE.md for | |
| defaults, types, and which kwargs are mutually exclusive. | |
| --- | |
| ## Recipe 1 — TRL `GRPOTrainer` subclass | |
| ### 1. When to use it | |
| This is the **default v0.0/v0.1 path** and the one we recommend for | |
| ~99% of users today. Pick TRL when: | |
| - Your model fits on ≤ 32 GPUs (typically ≤ 70B-param FSDP). | |
| - You already have a HuggingFace `model` + `tokenizer` + `datasets` flow. | |
| - You want minimum integration cost — `ComposerReplicationTrainer` is a | |
| single subclass override of `_compute_loss` over `trl.GRPOTrainer`, | |
| no Ray, no actor mesh. | |
| - You're doing single-host (one node, possibly multi-GPU FSDP) training. | |
| Don't pick TRL when you need >100 B-param scale, when you must async-decouple | |
| tool calls from the GPU loop, or when a Ray cluster is already in your stack | |
| (in which case Recipe 2 is cheaper). | |
| ### 2. Install command | |
| ```bash | |
| pip install -e ".[train,replaysim]" | |
| ``` | |
| The `train` extra pulls `trl>=0.12`, `peft`, `accelerate`, and `datasets`. | |
| The `replaysim` extra pulls `data-juicer` for CPU-side DPO normalization | |
| (channel 3 cleaning step). Add `[serverless]` if you also want Decoupled | |
| DiLoCo (see step 4). | |
| ### 3. Minimum-viable Python script | |
| ```python | |
| # train_trl.py — minimum viable Recipe 1 | |
| from datasets import load_dataset | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from composer_replication import ComposerReplicationTrainer | |
| MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" # swap for 7B once it works | |
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID) | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) | |
| dataset = load_dataset("trl-lib/tldr", split="train[:512]") | |
| def reward_length(completions, **_): | |
| return [-abs(len(c) - 64) for c in completions] | |
| trainer = ComposerReplicationTrainer( | |
| model = model, | |
| processing_class = tokenizer, | |
| reward_funcs = [reward_length], | |
| train_dataset = dataset, | |
| # Composer extras (defaults shown): | |
| alpha_sdpo = 0.1, | |
| beta_replay = 0.05, | |
| sdpo_jsd_beta = 0.5, | |
| sdpo_temperature = 1.0, | |
| sdpo_token_clip = None, | |
| replay_dpo_beta = 0.1, | |
| ) | |
| trainer.train() | |
| ``` | |
| Channels 2 and 3 **auto-disable per step** when their inputs aren't | |
| present in the batch (e.g. batches with no error sites get | |
| `sdpo_kl=0`). Set `alpha_sdpo=0` / `beta_replay=0` to disable globally | |
| for ablations. | |
| ### 4. Decoupled DiLoCo wiring | |
| `ComposerReplicationTrainer` is a single-process trainer. To run N | |
| replicas of it under Decoupled DiLoCo, layer the serverless stack on the | |
| outside: each replica runs the script above; `MockManager` stands in for | |
| `torchft.Manager` on the inner loop and `ObjectStoreAllReduce` runs the | |
| outer-loop pseudo-gradient exchange: | |
| ```python | |
| # diloco_replica.py — what each of the N replicas runs | |
| import os | |
| from composer_replication.diloco import make_diloco_outer_loop | |
| from composer_replication.diloco.serverless import ( | |
| LocalProcessExecutor, ObjectStoreAllReduce, MockManager, | |
| ) | |
| rendezvous = ObjectStoreAllReduce( | |
| uri = "s3://my-bucket/diloco-runs/run42/", | |
| world_size = 4, | |
| rank = int(os.environ["REPLICA_RANK"]), | |
| ) | |
| manager = MockManager(allreduce=rendezvous) | |
| # trainer.optimizer is the *inner* optimizer; the outer is built here: | |
| outer = make_diloco_outer_loop( | |
| inner_optimizer = trainer.optimizer, | |
| manager = manager, | |
| sync_every_h = 500, | |
| ) | |
| trainer.add_callback(outer.callback()) # syncs every H inner steps | |
| trainer.train() | |
| ``` | |
| The driver process spins these up with any `ServerlessExecutor`: | |
| ```python | |
| # Wave 14: ModalExecutor / HFJobsExecutor are skeletons (raise NotImplementedError); | |
| # use LocalProcessExecutor for testing. Swap once the cloud backends land. | |
| executor = LocalProcessExecutor() | |
| handles = executor.launch_replicas( | |
| n_replicas = 4, | |
| entrypoint = "diloco_replica.py", | |
| entrypoint_args = {"rendezvous": rendezvous.uri, | |
| "rank_env": "REPLICA_RANK"}, | |
| ) | |
| result = executor.collect(handles, timeout=3600) | |
| ``` | |
| ### 5. Distillation-loss wiring | |
| `ComposerReplicationTrainer` exposes the new ADR-007 channels via the | |
| shared `compose_loss` kwargs — pass them through `**kwargs` on the | |
| trainer and they're forwarded to `compose_loss`: | |
| ```python | |
| trainer = ComposerReplicationTrainer( | |
| model = model, processing_class = tokenizer, | |
| reward_funcs = [reward_length], train_dataset = dataset, | |
| # SimPO instead of DPO for channel 3: | |
| dpo_variant = "simpo", | |
| simpo_beta = 2.0, | |
| simpo_gamma = 1.0, | |
| # TAID for channel 2 (SakanaAI port; logit-space mix + forward-KL): | |
| sdpo_wrapper = "taid", | |
| taid_t = 0.4, # current TAID coeff in [0, 1]; | |
| # drive from TAIDScheduler if you want | |
| # the paper's adaptive scheme | |
| ) | |
| ``` | |
| Or, equivalently, drop `entropy_opd` in for `taid` if you want | |
| per-token entropy-gated forward/reverse KL instead of the | |
| linear-blend interpolation. SimPO does **not** require reference | |
| log-probs (channel 3 batches with `dpo_chosen_ref_logprobs` / | |
| `dpo_rejected_ref_logprobs` set are silently ignored). | |
| ### 6. Cost ballpark | |
| - **GPU**: single host, `g5.12xlarge` ($5.67/hr) or RunPod 4×A100-80GB | |
| (~$5–9/hr) gets you Qwen2.5-7B at moderate throughput. For Qwen2.5-72B | |
| you'll want 2–4× H100 — `p5.48xlarge` (~$98/hr on AWS, ~$25–30/hr on | |
| Lambda Cloud / RunPod community). | |
| - **API**: channel 3 teacher replay via OpenRouter — verified | |
| ~$0.98/trace at 50 steps × 3 teachers (spike 001). For a 100-trace | |
| curriculum that's ~$100 in teacher tokens. | |
| - **Storage**: negligible until you turn on DiLoCo (then see Recipe 4). | |
| ### 7. Known limitations as of Wave 14 | |
| - **Tool calls block the GPU.** TRL's rollout is synchronous; long | |
| tool-call latency idles the trainer. Async-decouple via Recipe 2/3/5 | |
| if this matters. | |
| - **No native multi-node.** TRL is single-process; multi-host scaling is | |
| via Decoupled DiLoCo (Recipe 4) on top, not via TRL itself. | |
| - **vLLM weight sync is co-located** — no resharding between FSDP and TP. | |
| At 70B+ this becomes the bottleneck and you should move to Recipe 2. | |
| - **`reward_funcs` must be Python callables** that return `list[float]`; | |
| shell-out reward graders need a wrapper. | |
| --- | |
| ## Recipe 2 — VeRL custom `adv_estimator` + DataProto extension | |
| ### 1. When to use it | |
| Pick VeRL when: | |
| - You need >70B-param scale or >32-GPU multi-host, *and* a Ray cluster | |
| is acceptable in your stack. | |
| - You're already using or willing to adopt **3D-HybridEngine** for | |
| efficient FSDP↔TP weight resharding (verified ~5× weight-sync speed-up | |
| vs co-located vLLM at 70B+). | |
| - You need async multi-turn rollouts where tool-call latency must not | |
| block the GPU loop. VeRL's `AsyncServer` + `AgentLoop` is the | |
| best-in-class option here. | |
| - You want extension points the framework's authors *expect* third | |
| parties to use — the `@register_adv_est("...")` decorator and the | |
| `DataProto` extension contract are first-class APIs. | |
| Don't pick VeRL if you're <7B-param or single-host (overkill — | |
| Recipe 1's Trainer subclass is one file, not a Ray cluster). | |
| ### 2. Install command | |
| ```bash | |
| pip install -e ".[replaysim]" | |
| pip install verl # not packaged as an extra; pinned at >=0.3 | |
| # Optional, for the Composer adapter: | |
| pip install -e ".[serverless]" # for Decoupled DiLoCo on top | |
| ``` | |
| The framework's verl adapter lives at | |
| `composer_replication.recipes.verl` (currently shape-only — see | |
| [Limitations](#7-known-limitations-as-of-wave-14-2) below). | |
| ### 3. Minimum-viable Python script | |
| VeRL's actual entry point is a Hydra/YAML config + `verl.trainer.main_ppo` | |
| CLI; the pythonic surface looks like this: | |
| ```python | |
| # train_verl.py — minimum viable Recipe 2 sketch | |
| from verl.trainer.ppo import core_algos | |
| from verl.trainer.ppo.ray_trainer import RayPPOTrainer | |
| from composer_replication.loss import compose_loss | |
| @core_algos.register_adv_est("grpo_composer") | |
| def composer_advantage(data, **kwargs): | |
| """Custom adv-estimator that adds SDPO + DPO channels to GRPO. | |
| Reads three extra DataProto keys (populated by the data prep step): | |
| - data.batch["sdpo_teacher_logits"] (channel 2) | |
| - data.non_tensor_batch["teacher_actions"] (channel 3) | |
| and returns the standard (advantages, returns) tuple plus a stashed | |
| composer-loss term consumed by the critic worker. | |
| """ | |
| advantages, returns = core_algos.compute_grpo_outcome_advantage(data, **kwargs) | |
| composer_term = compose_loss( | |
| model = kwargs["actor_module"], | |
| inputs = data.batch, | |
| alpha_sdpo = 0.1, | |
| beta_replay = 0.05, | |
| dpo_variant = "dpo", | |
| sdpo_wrapper = "none", | |
| ) | |
| data.meta_info["composer_loss"] = composer_term | |
| return advantages, returns | |
| # Then in your YAML: | |
| # algorithm: | |
| # adv_estimator: grpo_composer | |
| # and run: python -m verl.trainer.main_ppo --config-name composer_grpo | |
| ``` | |
| The full driver wires `RayPPOTrainer` against your config; consult VeRL's | |
| own quickstart for the Ray-cluster boilerplate. The composer-specific | |
| piece is just the registered estimator above. | |
| ### 4. Decoupled DiLoCo wiring | |
| VeRL's actor workers run in Ray; DiLoCo replicates the **whole VeRL job**. | |
| Each "replica" is one Ray cluster running Recipe 2 end-to-end; the outer | |
| loop is independent of Ray and just exchanges pseudo-gradients via the | |
| object store between Ray-job invocations: | |
| ```python | |
| from composer_replication.diloco.serverless import ( | |
| LocalProcessExecutor, ObjectStoreAllReduce, | |
| ) | |
| rendezvous = ObjectStoreAllReduce( | |
| uri = "s3://verl-diloco/run/", | |
| world_size = 4, | |
| ) | |
| executor = LocalProcessExecutor() # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError) — keep LocalProcessExecutor for now | |
| handles = executor.launch_replicas( | |
| n_replicas = 4, | |
| entrypoint = "verl.trainer.main_ppo", | |
| entrypoint_args = { | |
| "+algorithm.adv_estimator": "grpo_composer", | |
| "+algorithm.diloco.rendezvous": rendezvous.uri, | |
| "+algorithm.diloco.sync_every_h": 500, | |
| }, | |
| ) | |
| executor.collect(handles, timeout=24 * 3600) | |
| ``` | |
| The Ray cluster inside each replica handles intra-replica scaling | |
| (FSDP / TP / vLLM); the object-store exchange handles cross-replica | |
| sync. Bandwidth is identical to Recipe 1 (~2 GB / 30 min per replica | |
| for a 7B-param model in bf16) and well within S3 free-tier. | |
| ### 5. Distillation-loss wiring | |
| The custom `adv_estimator` from step 3 already calls `compose_loss`; | |
| flip the kwargs there to switch DPO → SimPO or add TAID: | |
| ```python | |
| composer_term = compose_loss( | |
| model = kwargs["actor_module"], | |
| inputs = data.batch, | |
| alpha_sdpo = 0.1, | |
| beta_replay = 0.05, | |
| dpo_variant = "simpo", # ← SimPO swap | |
| simpo_beta = 2.0, | |
| simpo_gamma = 1.0, | |
| sdpo_wrapper = "taid", # ← TAID wrap | |
| taid_schedule_step = data.meta_info.get("global_step", 0), | |
| taid_total_steps = 10_000, | |
| ) | |
| ``` | |
| VeRL's `data.meta_info` carries the global step automatically, which is | |
| exactly what TAID's interpolation schedule needs. Channel 2 batches | |
| without `student_init_logits` / `student_init_input_ids` are auto-skipped | |
| (returns 0 for that step). | |
| ### 6. Cost ballpark | |
| - **GPU**: 8× H100 (`p5.48xlarge` ~$98/hr on AWS, ~$25/hr on Lambda or | |
| RunPod community) is the entry point for 70B-class. Expect 32–256 | |
| H100 for full 671B (matches DeepSeek's reported VeRL config). | |
| - **API**: same ~$0.98/trace as Recipe 1 (channel 3 is a Python helper, | |
| not a VeRL primitive — costs are framework-independent). | |
| - **Ray cluster overhead**: head node + redis + dashboard adds ~1 | |
| CPU-instance ($0.10–0.50/hr) per cluster, negligible at GPU scale. | |
| ### 7. Known limitations as of Wave 14 | |
| - **`composer_replication.recipes.verl` is shape-only.** The decorator | |
| registration and DataProto extension are documented but not yet shipped | |
| as a runnable adapter — Wave 14 release exposes the *contract*, not the | |
| glue. Expect this to land in a v0.2 follow-up spike. | |
| - **Ray dependency.** Adds a heavyweight runtime; debugging | |
| cross-actor crashes can be painful. Use VeRL's `--debug` mode early. | |
| - **Custom-`adv_estimator` LOC**: writing your own takes ~50–150 LOC | |
| including DataProto plumbing. Not a one-liner. | |
| - **No first-class TAID hook in VeRL itself** — we route TAID through | |
| the meta_info channel; this works but means you can't use VeRL's | |
| built-in checkpoint-replay tooling without re-stamping `taid_schedule_step` | |
| on each replay. | |
| --- | |
| ## Recipe 3 — PRIME-RL `CustomLossConfig` | |
| ### 1. When to use it | |
| Pick PRIME-RL when: | |
| - You're operating in the **PRIME-Intellect / decentralized training** | |
| universe and want INTELLECT-style scaling on a long-horizon training | |
| run. | |
| - You need **DPPO importance-ratio masking** (the rationale most users | |
| arrive with) — PRIME-RL's headline contribution is the | |
| out-of-band-token *mask* (not clip) on `log_ratio = trainer_lp - | |
| inference_lp`, with defaults `low=-4.0, high=4.0`. | |
| - You want a **first-class custom-loss surface**: PRIME-RL ships | |
| `CustomLossConfig` that takes an importable Python function and a | |
| `LossInputs` struct exposing exactly the tensors we need | |
| (`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, | |
| `advantages`, `loss_mask`). No fork, no Trainer subclass, no monkey-patch. | |
| - You have access to multi-node infrastructure that PRIME-RL's | |
| trainer/inference/orchestrator split is designed for. | |
| Don't pick PRIME-RL if you need full vocab logits (channel 2 SDPO | |
| requires logits not log-probs — see Limitations). | |
| ### 2. Install command | |
| ```bash | |
| pip install -e ".[prime-rl,replaysim]" | |
| # pulls prime-rl>=0.5 | |
| ``` | |
| ### 3. Minimum-viable Python script | |
| PRIME-RL drives via YAML config; the only Python you write is the | |
| custom-loss function (already shipped at | |
| `composer_replication/recipes/prime_rl/composer_loss.py`). Wire it in: | |
| ```yaml | |
| # prime_rl_config.yaml — point at the framework's adapter | |
| loss: | |
| custom: | |
| import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn | |
| kwargs: | |
| alpha_sdpo: 0.0 # channel 2 deferred in v0 (see below) | |
| beta_dpo: 0.0 # channel 3 emits a warning if non-zero | |
| dppo_mask_high: 4.0 # PRIME-RL DPPO mask bounds | |
| dppo_mask_low: -4.0 | |
| epsilon: 1.0e-6 | |
| trainer: | |
| model: Qwen/Qwen2.5-7B-Instruct | |
| ... # standard PRIME-RL fields | |
| ``` | |
| The shipped `loss_fn` signature is fixed by PRIME-RL's contract: | |
| ```python | |
| def loss_fn( | |
| inputs: LossInputs, | |
| *, | |
| alpha_sdpo: float = 0.0, | |
| beta_dpo: float = 0.0, | |
| dppo_mask_high: float = 4.0, | |
| dppo_mask_low: float = -4.0, | |
| epsilon: float = 1e-6, | |
| ) -> torch.Tensor: | |
| log_ratio = inputs.trainer_logprobs - inputs.inference_logprobs | |
| dppo_invalid = (log_ratio > dppo_mask_high) | (log_ratio < dppo_mask_low) | |
| keep_mask = inputs.loss_mask & ~dppo_invalid | |
| grpo = -(inputs.advantages * inputs.trainer_logprobs * keep_mask).sum() \ | |
| / keep_mask.sum().clamp_min(epsilon) | |
| if alpha_sdpo != 0.0: | |
| raise NotImplementedError( | |
| "Channel 2 SDPO requires full-vocab logits; PRIME-RL v0.5 " | |
| "exposes only log-probs. Deferred to v0.2." | |
| ) | |
| if beta_dpo != 0.0: | |
| import warnings; warnings.warn( | |
| "Channel 3 trace-replay DPO is out-of-scope for PRIME-RL recipe v0", | |
| stacklevel=2, | |
| ) | |
| return grpo | |
| ``` | |
| **Shape note** (caught in the Wave 13 cross-model review): PRIME-RL | |
| calls the loss function **once per sample**; tensors are 1-D `(seq,)`, | |
| *not* batched `(B, T)`. The 10 unit tests in | |
| `composer_replication/recipes/prime_rl/tests/test_composer_loss.py` | |
| cover this plus DPPO mask edges. | |
| ### 4. Decoupled DiLoCo wiring | |
| PRIME-RL was designed for decentralized training and ships its own | |
| weight-sync primitives. Stack DiLoCo on top via the | |
| `ServerlessExecutor` Protocol — each replica runs an independent | |
| PRIME-RL job pointing at the same `composer_loss:loss_fn`: | |
| ```python | |
| from composer_replication.diloco.serverless import ( | |
| LocalProcessExecutor, ObjectStoreAllReduce, | |
| ) | |
| rendezvous = ObjectStoreAllReduce( | |
| uri = "s3://prime-rl-diloco/run/", | |
| world_size = 4, | |
| ) | |
| # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError until v0.x). | |
| # Use LocalProcessExecutor for the inner-replica wiring; swap to the cloud | |
| # executor once it lands. The DiLoCo + rendezvous code below is identical. | |
| executor = LocalProcessExecutor() | |
| handles = executor.launch_replicas( | |
| n_replicas = 4, | |
| entrypoint = "prime_rl.cli:main", | |
| entrypoint_args = { | |
| "config": "prime_rl_config.yaml", | |
| "+diloco.rendezvous": rendezvous.uri, | |
| "+diloco.sync_every_h": 500, | |
| }, | |
| ) | |
| executor.collect(handles, timeout=24 * 3600) | |
| ``` | |
| Note PRIME-RL's own multi-node story (the trainer / inference / | |
| orchestrator split) is **orthogonal** to Decoupled DiLoCo: PRIME-RL | |
| multi-node = single replica scaled across many GPUs; DiLoCo = N | |
| independent replicas synchronizing via object store. Combine both for | |
| "big PRIME-RL job × N replicas". | |
| ### 5. Distillation-loss wiring | |
| Channel 2 (SDPO + TAID + Entropy-OPD) is **deferred** in v0 because | |
| PRIME-RL's `LossInputs` exposes log-probs not full vocab logits. The | |
| SimPO swap on channel 3 is also gated by the same shape constraint, but | |
| DPPO-clip itself doesn't change. To get TAID/SimPO into a PRIME-RL job | |
| today you must: | |
| 1. Switch to Recipe 1 or 2 for the SFT/distill phase. | |
| 2. Use PRIME-RL only for the on-policy GRPO+DPPO phase. | |
| The v0.2 plan (per ADR-007) is to extend `LossInputs` with a | |
| `teacher_logits` field; the loss adapter is already shape-ready. | |
| ### 6. Cost ballpark | |
| - **GPU**: similar profile to Recipe 2 — 8–32 H100 typical, scales to | |
| hundreds for INTELLECT-class runs. Lambda Cloud or RunPod community | |
| H100 community pricing (~$2–4/hr per H100) is most cost-effective. | |
| - **API**: channel 3 is gated, so the only OpenRouter spend is from the | |
| *offline data-prep* spike (using the verifier harness in Recipe 1 to | |
| pre-bake DPO pairs), not from the training loop itself. Order of | |
| magnitude: $50–500 for a curriculum-bake one-time, then $0/run. | |
| - **Network**: PRIME-RL's own decentralized weight sync uses substantial | |
| bandwidth between training replicas (one of its design constraints); | |
| this is *separate* from the Decoupled DiLoCo bandwidth and shows up | |
| as a ceiling on cross-region replica placement. | |
| ### 7. Known limitations as of Wave 14 | |
| - **Channel 2 deferred** — see step 5. `alpha_sdpo > 0` raises | |
| `NotImplementedError`. | |
| - **Channel 3 emits a warning** if `beta_dpo != 0`; trace-replay DPO | |
| pairs must be folded into the *training data* (offline) rather than | |
| the *loss* (online) until v0.2. | |
| - **PRIME-RL ≥ 0.5 required.** Earlier versions don't ship | |
| `CustomLossConfig`. | |
| - **Smoke test deferred.** Per `prime_rl_recipe.md`, the runtime smoke | |
| test requires a CUDA box + `prime-rl >= 0.5` install and is gated | |
| to a follow-up spike. The 10 unit tests run cleanly without GPU. | |
| - **DPPO defaults are PRIME-RL's, not ours.** We pin `low=-4.0, | |
| high=4.0` to match. If you change them, you're now diverging from | |
| PRIME-RL's example configs. | |
| --- | |
| ## Recipe 4 — Serverless Decoupled DiLoCo | |
| ### 1. When to use it | |
| Pick Decoupled DiLoCo when: | |
| - You have **N independent training replicas** that should sync | |
| occasionally but can't (or shouldn't) cross-talk on every step. | |
| - The cost or operational burden of an always-on multi-node cluster is | |
| unacceptable, but you're happy paying for 4× independent **serverless | |
| jobs**. | |
| - Your inner trainer is one of Recipes 1–3 — DiLoCo wraps any inner | |
| optimizer; it's *purely outer-loop*. | |
| - You need **failure isolation**: if one replica crashes, the others | |
| keep training; on restart it picks up from the last outer round. | |
| DiLoCo's design rests on two abstractions (per ADR-005): | |
| 1. **`ServerlessExecutor` Protocol** — uniform interface for spinning up | |
| N replicas across cloud backends (Modal / HF Jobs / SageMaker / k8s). | |
| 2. **`ObjectStoreAllReduce`** — fsspec-backed pseudo-gradient exchange | |
| that replaces the in-process `torchft.Manager.allreduce` call. | |
| The communication pattern is `S3 PutObject + N GetObjects` once per | |
| inner-H steps, matching DiLoCo paper §3.2 (arXiv:2311.08105). For | |
| 1B-param bf16 that's ~2 GB / 30 min per replica — well within S3 | |
| free-tier. | |
| ### 2. Install command | |
| ```bash | |
| pip install -e ".[diloco,serverless]" | |
| # also one of the inner-trainer extras: | |
| pip install -e ".[train]" # if the inner trainer is Recipe 1 | |
| # OR pip install verl # if the inner trainer is Recipe 2 | |
| # OR pip install -e ".[prime-rl]" # if the inner trainer is Recipe 3 | |
| ``` | |
| ### 3. Minimum-viable Python script | |
| This pattern is independent of the inner trainer — pick any of Recipes | |
| 1/2/3 and wrap it with a `ServerlessExecutor`. The replica entrypoint | |
| runs the inner trainer; the driver launches N of them and waits. | |
| ```python | |
| # diloco_driver.py — driver that launches N replicas | |
| from composer_replication.diloco.serverless import ( | |
| LocalProcessExecutor, # for dev — runs replicas as local subprocesses | |
| ObjectStoreAllReduce, | |
| ) | |
| rendezvous = ObjectStoreAllReduce( | |
| uri = "s3://my-bucket/diloco-runs/run42/", # or file:// for local | |
| world_size = 4, | |
| ) | |
| executor = LocalProcessExecutor() # Wave 14: ModalExecutor skeleton raises NotImplementedError; swap once cloud backend lands | |
| handles = executor.launch_replicas( | |
| n_replicas = 4, | |
| entrypoint = "diloco_replica.py", # (script below) | |
| entrypoint_args = { | |
| "rendezvous": rendezvous.uri, | |
| "rank_env": "REPLICA_RANK", | |
| }, | |
| ) | |
| result = executor.collect(handles, timeout=3600) | |
| print({h.replica_id: h.exit_code for h in result}) | |
| ``` | |
| ```python | |
| # diloco_replica.py — runs inside each replica | |
| import os | |
| from composer_replication.diloco import make_diloco_outer_loop | |
| from composer_replication.diloco.serverless import ( | |
| ObjectStoreAllReduce, MockManager, | |
| ) | |
| # Build inner trainer (Recipe 1 example): | |
| from train_trl import trainer | |
| rendezvous = ObjectStoreAllReduce( | |
| uri = os.environ["DILOCO_RENDEZVOUS"], | |
| world_size = 4, | |
| rank = int(os.environ["REPLICA_RANK"]), | |
| ) | |
| manager = MockManager(allreduce=rendezvous) | |
| outer = make_diloco_outer_loop( | |
| inner_optimizer = trainer.optimizer, | |
| manager = manager, | |
| sync_every_h = 500, | |
| ) | |
| trainer.add_callback(outer.callback()) | |
| trainer.train() | |
| ``` | |
| ### 4. Decoupled DiLoCo wiring | |
| This recipe **is** the DiLoCo wiring — see step 3. The available | |
| executor adapters are: | |
| | Executor | Status | Use case | | |
| |---------------------------|-------------------------------|--------------------------------------| | |
| | `LocalProcessExecutor` | Production-ready | Dev loop — N subprocesses on one box | | |
| | `ModalExecutor` | Skeleton (modal-client gated) | Modal cloud, $/sec billing | | |
| | `HFJobsExecutor` | Skeleton (hf-hub gated) | HuggingFace Jobs, transformer-shop | | |
| | `SageMakerExecutor` | Roadmap (post-v0.2) | AWS, warm-pool ~10s cold start | | |
| | `K8sExecutor` | Roadmap | KubeRay / Volcano gang scheduling | | |
| Cross-cloud replica placement (e.g. 2× Modal + 2× HF Jobs) is supported | |
| in principle — they all read/write the same S3 / GCS / HF rendezvous — | |
| but treat as experimental. | |
| ### 5. Distillation-loss wiring | |
| DiLoCo is loss-agnostic — it operates purely on inner-optimizer state. | |
| Whichever inner trainer you're running (Recipe 1, 2, or 3) handles | |
| distillation kwargs as documented in that recipe's step 5. The only | |
| DiLoCo-specific knob worth knowing: TAID's `taid_schedule_step` is a | |
| *global* counter, but each replica increments it independently. If you | |
| care about replicas all reading the same α at outer-sync time, set | |
| `taid_schedule_step = trainer.state.global_step + replica_offset` and | |
| let the outer-loop sync average them out. | |
| ### 6. Cost ballpark | |
| Pulled from | |
| [`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`](research/DILOCO_SERVERLESS_RECONNAISSANCE.md): | |
| | Backend | A100-80GB $/hr | H100 $/hr | Cold-start | Notes | | |
| |---------------|----------------|-----------|------------|------------------------------------------| | |
| | Modal | $1.39/sec → 4× ≈ $20/hr per A100 | ~$8/hr per H100 | 1–60s warm, 60–120s first-run | $/sec billing; no minimum | | |
| | AWS SageMaker | $4.10/A100·hr | $12.29/hr | 2–5 min cold, ~10s warm pool | Min 60min on warm pool | | |
| | GCP Vertex | $3.67/A100·hr | $11/hr | 2–6 min cold | 30–50% premium over raw GPU | | |
| | Azure ML | ~$3.67/A100·hr | ~$12.25/hr | 3–8 min cold | Use curated env to cut cold-start | | |
| | RunPod | $1.19/hr (community), $2.17 (secure) | $1.99/hr (community), $4.18 (secure) | seconds | No federation; same-DC only | | |
| | HF Jobs | comparable to Modal | ~$8–12/hr | 30–90s | Best DX for HF-shop | | |
| **Object-store cost.** ~$0.02/GB-month for S3 standard, ~$0/free-tier. | |
| Pseudo-gradients are ~2 GB per replica per outer round; for a 24-hour | |
| 4-replica run at H=500 that's ~50 outer rounds × 2 GB × 4 replicas = ~400 | |
| GB written. Free-tier blows through fast — budget $10–20 in storage. | |
| ### 7. Known limitations as of Wave 14 | |
| - **`ModalExecutor` and `HFJobsExecutor` are skeletons.** They check | |
| `import modal` / `import huggingface_hub` at *adapter init* time and | |
| raise; the actual `launch_replicas` is shape-only until the relevant | |
| spike lands. Use `LocalProcessExecutor` for dev. | |
| - **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly — | |
| the unit test `test_object_store_allreduce_world_size_1_passthrough` | |
| is the regression guard. Don't override unless you've read it. | |
| - **Rank validation is mandatory.** Tests assert | |
| `ObjectStoreAllReduce(rank=N, world_size=N)` raises (rank must be | |
| `< world_size`); silent corruption otherwise. | |
| - **`MockManager` is *not* feature-complete.** It implements the | |
| `Manager.allreduce` surface that DiLoCo's outer-loop needs, but | |
| not the full `torchft.Manager` API (no fault-tolerance, no | |
| membership protocol). Don't use it as a drop-in for live torchft. | |
| - **No native heterogeneous compute** — all replicas are assumed to | |
| have the same compute shape. Mixed A100+H100 placements work but | |
| the slow replica gates outer-loop progress. | |
| --- | |
| ## Recipe 5 — Monarch actor mesh | |
| ### 1. When to use it | |
| Pick Monarch when: | |
| - You're at **TorchForge-style topology scale**: trainer / generator / | |
| rewarder / N-teachers all want to be independent, asynchronously | |
| scheduled, fault-tolerant actors on a typed mesh. | |
| - You want **heterogeneous executor support** — different actors run | |
| in different clouds (e.g. `TrainerActor` on Modal A100s, | |
| `GeneratorActor` on dedicated H100s, `TeacherPoolActor` as 0-GPU CPU | |
| pods on k8s). | |
| - You need **hot-swap of actor implementations** — replace | |
| "OpenRouter teachers" with "local vLLM teachers" by changing one | |
| Monarch binding, no trainer code change. | |
| - You're prepared to track **upstream Monarch** (v0.4.1 stable, v0.5 | |
| dev daily); the API is moving and v0 of this recipe is intentionally | |
| deferred per ADR-006. | |
| Don't pick Monarch in Wave 14 unless you're explicitly scoping a | |
| v0.2+ pilot. The framework ships *skeleton* actors that fail-fast on | |
| instantiation; this is a reference-pattern reading exercise, not a | |
| production target. | |
| ### 2. Install command | |
| ```bash | |
| pip install -e ".[prime-rl,monarch]" | |
| # pulls monarch>=0.4.1 plus the PRIME-RL trainer used inside actors | |
| ``` | |
| ### 3. Minimum-viable Python script | |
| The framework ships skeleton actor definitions at | |
| `composer_replication/recipes/monarch/actors.py`; they raise | |
| `NotImplementedError` on instantiation in Wave 14. The shape of the | |
| final answer: | |
| ```python | |
| # monarch_train.py — what v0.2+ usage will look like | |
| from monarch import Actor, mesh, endpoint | |
| from composer_replication.recipes.monarch.actors import ( | |
| TrainerActor, GeneratorActor, RewarderActor, TeacherPoolActor, | |
| ) | |
| # Topology | |
| trainers = mesh.spawn(TrainerActor, n=4, gpu="A100") | |
| generator = mesh.spawn(GeneratorActor, n=1, gpu="A100") | |
| rewarder = mesh.spawn(RewarderActor, n=1, gpu=None) | |
| teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None) | |
| # Wire endpoints | |
| async def outer_step(batch_id: int): | |
| prompts = await trainers[0].sample_prompts.call(batch_id) | |
| rollouts = await generator.rollout.call(prompts) | |
| rewards = await rewarder.score.call(rollouts) | |
| teacher_acts = await teachers.replay.call([ | |
| {"state": r["state"]} for r in rollouts | |
| ]) | |
| await trainers.train_outer_step.call( | |
| batch_id, rollouts=rollouts, rewards=rewards, | |
| teacher_actions=teacher_acts, | |
| ) | |
| # Run | |
| import asyncio | |
| for batch_id in range(1000): | |
| asyncio.run(outer_step(batch_id)) | |
| ``` | |
| The Composer 3-channel loss lives inside `TrainerActor.train_outer_step`, | |
| which calls `compose_loss(...)` exactly as Recipe 1 does. The | |
| *orchestration* changes; the *loss math* doesn't. | |
| ### 4. Decoupled DiLoCo wiring | |
| Monarch + Decoupled DiLoCo compose naturally: each `TrainerActor` is a | |
| DiLoCo replica, and Monarch's supervision tree handles the failure | |
| recovery that ADR-005 lists as a DiLoCo design constraint. The wire-up | |
| is identical to Recipe 4's `LocalProcessExecutor` pattern, just running | |
| inside Monarch instead of `subprocess`: | |
| ```python | |
| from composer_replication.diloco.serverless import ( | |
| ObjectStoreAllReduce, MockManager, | |
| ) | |
| class TrainerActor(Actor): | |
| def __init__(self, rendezvous_uri: str, rank: int, world_size: int): | |
| self.rendezvous = ObjectStoreAllReduce( | |
| uri=rendezvous_uri, rank=rank, world_size=world_size, | |
| ) | |
| self.manager = MockManager(allreduce=self.rendezvous) | |
| # ... build inner ComposerReplicationTrainer ... | |
| @endpoint | |
| async def train_outer_step(self, batch_id: int, **kw): | |
| # Inner H steps locally, then sync via self.rendezvous | |
| ... | |
| ``` | |
| The "object store" is the cross-actor synchronization point that | |
| *doesn't* go through Monarch's RDMA data plane — by design, slow | |
| syncs (S3) and fast syncs (RDMA for in-actor weight broadcast) live on | |
| different planes. | |
| ### 5. Distillation-loss wiring | |
| Monarch sees the loss as opaque: it lives inside `TrainerActor` and | |
| takes the same `compose_loss` kwargs as Recipe 1. The mesh-level | |
| benefit is **swap-by-binding**: you can replace `TeacherPoolActor` | |
| ("OpenRouter") with a `LocalVLLMTeacherActor` to switch the | |
| *supplier* of teacher log-probs without touching the loss config. | |
| ```python | |
| # Original binding — channel 3 via OpenRouter | |
| teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None) | |
| # Swap binding — channel 3 via local vLLM | |
| teachers = mesh.spawn(LocalVLLMTeacherActor, n=1, gpu="A100", | |
| model_id="Qwen/Qwen2.5-72B-Instruct") | |
| # Trainer config unchanged: | |
| trainer.compose_loss_kwargs = dict( | |
| dpo_variant = "simpo", # same as before | |
| sdpo_wrapper = "taid", | |
| taid_schedule_step = batch_id, | |
| taid_total_steps = 10_000, | |
| ) | |
| ``` | |
| ### 6. Cost ballpark | |
| In Wave 14: $0 (skeleton fails fast; no compute used). Projected for v0.2+: | |
| - **Mesh overhead**: Monarch's coordination plane is light — typically | |
| <1% of total compute even at 4-actor scale. The dominant cost is | |
| whatever the actors run. | |
| - **Heterogeneous placement** is the cost lever: e.g. a 4-trainer mesh | |
| with `TeacherPoolActor` on 0-GPU CPU pods can cut total $/hr by | |
| ~10–20% vs forcing all actors onto GPU nodes. | |
| - **Cluster bring-up**: Monarch v0.5's Slurm backend is stable; k8s | |
| backend is dev-track; bare-metal SSH backend is documented. | |
| ### 7. Known limitations as of Wave 14 | |
| - **Skeleton only, fails fast.** Importing `actors.py` is fine; | |
| instantiating `TrainerActor(...)` raises `NotImplementedError("v0 | |
| skeleton; deferred to v0.2 per ADR-006")`. By design. | |
| - **Upstream Monarch API is moving.** v0.4.1 stable + v0.5 dev daily | |
| means breaking changes are expected. Pin to a Monarch hash if you | |
| prototype. | |
| - **TorchForge is paused.** Per its own repo banner — don't take | |
| TorchForge's recipes as production patterns. Monarch alone is | |
| active; Forge as a layered framework is reference reading. | |
| - **Open question (deferred):** does Monarch v0.5's Slurm backend | |
| hand-shake cleanly with HF Jobs lifecycle? See | |
| `monarch_actor_layout.md` for the open-questions list. | |
| - **Open question (deferred):** can `TrainerActor` host | |
| `ComposerReplicationTrainer` unmodified, or does it need a | |
| `step_init` / `step_compute` split for Monarch's async actor model? | |
| --- | |
| ## Comparison matrix | |
| | Dimension | Recipe 1 — TRL | Recipe 2 — VeRL | Recipe 3 — PRIME-RL | Recipe 4 — Serverless DiLoCo | Recipe 5 — Monarch | | |
| |------------------------------------|-----------------------------|----------------------------------|-----------------------------------|------------------------------------|-------------------------------------| | |
| | **Maturity (Wave 14)** | Production-ready | Production-ready (adapter shape-only) | Recipe ready, runtime smoke deferred | `LocalProcessExecutor` ready; cloud adapters skeleton | Skeleton only; v0.2+ scope | | |
| | **Supports DAPO / GRPO** | GRPO ✅; DAPO via TRL master | GRPO ✅; DAPO ✅ (built-in) | GRPO+DPPO ✅ (DAPO mask is the headline) | Inherits from inner trainer | Inherits from inner trainer | | |
| | **Custom-loss extension cost (LOC)** | ~30 LOC (subclass override) | ~50–150 LOC (registered estimator) | ~20 LOC (single Python fn) | 0 (transparent wrapper) | ~30 LOC (loss inside actor) | | |
| | **OpenEnv-compatible** | ✅ (HF datasets layer) | ✅ (DataProto extension) | ✅ (rollout JSONL contract) | ✅ (orthogonal) | ✅ (RewarderActor binding) | | |
| | **Native multi-node** | ❌ (single-host FSDP only) | ✅ (Ray cluster + 3D-HybridEngine) | ✅ (trainer/inference/orchestrator split) | ✅ (the *whole point*) | ✅ (mesh of actors) | | |
| | **Native Decoupled DiLoCo** | ❌ — wrap with Recipe 4 | ❌ — wrap with Recipe 4 | ❌ — wrap with Recipe 4 | ✅ (this *is* it) | ✅ (compose with Recipe 4 inside actor) | | |
| | **License** | Apache 2.0 (TRL) | Apache 2.0 (VeRL) | Apache 2.0 (PRIME-RL) | Apache 2.0 (this repo) | BSD-3 (Monarch) | | |
| | **Our recommendation (Wave 14)** | **Default for ≤ 70B / single-host** | Pick at >70B *if* Ray is acceptable | Pick if PRIME-Intellect / DPPO mask is required | Stack on top of 1/2/3 for N replicas | Reference pattern only — revisit v0.2 | | |
| --- | |
| ## Cross-recipe checklist | |
| Regardless of which recipe you pick, these invariants are tested across | |
| the 115-test suite (post-Wave-15) and should be true of your wired-up system: | |
| - **`alpha_sdpo=0`** must reproduce the channel-1-only baseline | |
| bit-exact (`test_compose_loss_integration.py`). | |
| - **`beta_replay=0`** must reproduce the no-channel-3 baseline | |
| bit-exact. | |
| - **`sdpo_wrapper="taid"` without `taid_schedule_step`** must `ValueError` | |
| at first step (`test_compose_loss_integration.py`). | |
| - **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 0`** | |
| must ignore the teacher signal (`test_taid_loss_alpha_zero_ignores_teacher`). | |
| - **`sdpo_wrapper="taid"` at `taid_schedule_step / taid_total_steps = 1`** | |
| must equal plain SDPO (`test_taid_blended_logits_endpoints`). | |
| - **`dpo_variant="simpo"`** must be differentiable through the | |
| `loss-of-sigmoid` path (`test_simpo_loss_differentiable`). | |
| - **`sdpo_wrapper="entropy_opd"`** must zero out when student ≡ teacher | |
| (`test_entropy_aware_opd_zero_when_distributions_match`). | |
| - **`ObjectStoreAllReduce(world_size=1)`** must passthrough cleanly | |
| (`test_object_store_allreduce_world_size_1_passthrough`). | |
| If any of these fail in your wired-up system, run the corresponding | |
| unit test to localize: most break because a kwarg got dropped at the | |
| adapter boundary, not because the loss math is wrong. | |
| --- | |
| ## Picking a recipe — decision flow | |
| 1. **Piloting Monarch (v0.2+)?** → Recipe 5. | |
| 2. **Else, need >70B / multi-host?** → Recipe 2 (VeRL) if Ray is OK, | |
| Recipe 3 (PRIME-RL) if you're in the PRIME-Intellect / DPPO universe, | |
| otherwise wait for Recipe 5. | |
| 3. **Else** → Recipe 1 (TRL) is the v0.0/v0.1 default. | |
| 4. **At any of 1–3, need N independent replicas / failure isolation?** | |
| → Stack Recipe 4 (Decoupled DiLoCo) on top. | |
| --- | |
| ## Pointers to source | |
| - Loss core: [`composer_replication/loss.py`](../composer_replication/loss.py) | |
| - TRL trainer: [`composer_replication/trainer/composer_trainer.py`](../composer_replication/trainer/composer_trainer.py) | |
| - PRIME-RL adapter: | |
| [`composer_replication/recipes/prime_rl/composer_loss.py`](../composer_replication/recipes/prime_rl/composer_loss.py), | |
| recipe doc: | |
| [`composer_replication/recipes/prime_rl/prime_rl_recipe.md`](../composer_replication/recipes/prime_rl/prime_rl_recipe.md) | |
| - Monarch skeleton: | |
| [`composer_replication/recipes/monarch/actors.py`](../composer_replication/recipes/monarch/actors.py), | |
| layout doc: | |
| [`composer_replication/recipes/monarch/monarch_actor_layout.md`](../composer_replication/recipes/monarch/monarch_actor_layout.md) | |
| - Serverless DiLoCo: | |
| [`composer_replication/diloco/serverless/`](../composer_replication/diloco/serverless/) | |
| - VeRL adapter (shape-only): `composer_replication/recipes/verl/` | |
| - ADRs: | |
| [`docs/adrs/ADR-005-serverless-diloco.md`](adrs/ADR-005-serverless-diloco.md), | |
| [`docs/adrs/ADR-006-rl-frameworks.md`](adrs/ADR-006-rl-frameworks.md), | |
| [`docs/adrs/ADR-007-distillation-losses.md`](adrs/ADR-007-distillation-losses.md) | |
| --- | |
| **File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md` | |