# Examples Index Five CPU-runnable examples demonstrating the framework end-to-end on real HF causal LMs. They form a progression from simplest to most methodologically complete: | # | Example | Trace source | Channels | Wall-clock | Closes | |---|---|---|---|---|---| | 1 | [`qwen_05b_quickstart/`](qwen_05b_quickstart/) | minimal toy | LM-CE only | ~30s | "does the package import + run at all" | | 2 | [`gsm8k_grpo/`](gsm8k_grpo/) | hand-written GSM8K (100 rows) | GRPO with `alpha=beta=0` | ~60s | Plain-GRPO baseline reference | | 3 | [`gsm8k_grpo_with_sdpo/`](gsm8k_grpo_with_sdpo/) | hand-written GSM8K (B=2) | GRPO + SDPO column | ~25s | SDPO column wiring on synthetic prompts | | 4 | [`sdpo_with_real_traces/`](sdpo_with_real_traces/) | `ClaudeCodeIngester` reading a hand-authored session JSONL | GRPO + SDPO column | ~30s | **Partial V5** — ingestion path validated; wiring smoke (misaligned) | | **5** | **[`sdpo_with_real_traces_production/`](sdpo_with_real_traces_production/)** | **`ClaudeCodeIngester` → adapter → `ComposerDataCollator`** (with-error fixture) | **GRPO + SDPO (production-aligned)** | **~2min** | **V5 closure** — full production pipeline with error-site detection + properly-aligned SDPO mask | **Recommended walk-through order**: 1 → 2 → 3 → 4 → 5. Each builds on the previous in scope. ## Why five? - **#1** verifies the package is installable and the loss composition works at all (no SDPO, no DPO — pure LM-CE on a toy model). - **#2** uses the production `ComposerReplicationTrainer` (TRL `GRPOTrainer` subclass) on a real GSM8K dataset with a regex-extract reward. This is the recipe a new user copy-pastes to start. - **#3** drops the TRL trainer wrapper and calls `compose_loss` directly on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5 changes the loss" with all the wiring visible. - **#4** uses real ingested Claude Code session JSONL (via `ClaudeCodeIngester`) but builds the SDPO batch by hand — demonstrates the ingester works but the SDPO mask covers misaligned content. Wiring smoke, not production-grade. - **#5** is the production-grade sibling to #4: adds the `claude_states_to_trace_examples` adapter and uses `ComposerDataCollator` to build properly-aligned SDPO batches with hint injection at actual error sites. **This is what you should copy for real training.** ## What every example asserts Each `run.py` ends with a verification block that asserts: - The targeted channel(s) actually fired (`sdpo_jsd > 0` when alpha_sdpo > 0) - The composed loss isn't trivially equal to `lm_ce` alone - Gradient norms are finite and non-zero at every step Failure of any assertion exits non-zero and the script prints which channel didn't fire. This is the user's smoke test, not just a demo. ## Production training For real training (GPU, larger models, longer rollouts), use `ComposerReplicationTrainer` directly with a `ComposerDataCollator` that emits SDPO + DPO columns — exactly the path example #5 demonstrates. See `docs/INTEGRATION_RECIPES.md` for the production wiring patterns.