Codeseys's picture
Wave 19: production-grade SDPO via ComposerDataCollator + adapter + collator fixes
03bf323

Examples Index

Five CPU-runnable examples demonstrating the framework end-to-end on real HF causal LMs. They form a progression from simplest to most methodologically complete:

# Example Trace source Channels Wall-clock Closes
1 qwen_05b_quickstart/ minimal toy LM-CE only ~30s "does the package import + run at all"
2 gsm8k_grpo/ hand-written GSM8K (100 rows) GRPO with alpha=beta=0 ~60s Plain-GRPO baseline reference
3 gsm8k_grpo_with_sdpo/ hand-written GSM8K (B=2) GRPO + SDPO column ~25s SDPO column wiring on synthetic prompts
4 sdpo_with_real_traces/ ClaudeCodeIngester reading a hand-authored session JSONL GRPO + SDPO column ~30s Partial V5 — ingestion path validated; wiring smoke (misaligned)
5 sdpo_with_real_traces_production/ ClaudeCodeIngester → adapter → ComposerDataCollator (with-error fixture) GRPO + SDPO (production-aligned) ~2min V5 closure — full production pipeline with error-site detection + properly-aligned SDPO mask

Recommended walk-through order: 1 → 2 → 3 → 4 → 5. Each builds on the previous in scope.

Why five?

  • #1 verifies the package is installable and the loss composition works at all (no SDPO, no DPO — pure LM-CE on a toy model).
  • #2 uses the production ComposerReplicationTrainer (TRL GRPOTrainer subclass) on a real GSM8K dataset with a regex-extract reward. This is the recipe a new user copy-pastes to start.
  • #3 drops the TRL trainer wrapper and calls compose_loss directly on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5 changes the loss" with all the wiring visible.
  • #4 uses real ingested Claude Code session JSONL (via ClaudeCodeIngester) but builds the SDPO batch by hand — demonstrates the ingester works but the SDPO mask covers misaligned content. Wiring smoke, not production-grade.
  • #5 is the production-grade sibling to #4: adds the claude_states_to_trace_examples adapter and uses ComposerDataCollator to build properly-aligned SDPO batches with hint injection at actual error sites. This is what you should copy for real training.

What every example asserts

Each run.py ends with a verification block that asserts:

  • The targeted channel(s) actually fired (sdpo_jsd > 0 when alpha_sdpo > 0)
  • The composed loss isn't trivially equal to lm_ce alone
  • Gradient norms are finite and non-zero at every step

Failure of any assertion exits non-zero and the script prints which channel didn't fire. This is the user's smoke test, not just a demo.

Production training

For real training (GPU, larger models, longer rollouts), use ComposerReplicationTrainer directly with a ComposerDataCollator that emits SDPO + DPO columns — exactly the path example #5 demonstrates. See docs/INTEGRATION_RECIPES.md for the production wiring patterns.