Wave 19: production-grade SDPO via ComposerDataCollator + adapter + collator fixes

03bf323 11 days ago

3.12 kB

Examples Index

Five CPU-runnable examples demonstrating the framework end-to-end on real HF causal LMs. They form a progression from simplest to most methodologically complete:

#	Example	Trace source	Channels	Wall-clock	Closes
1	`qwen_05b_quickstart/`	minimal toy	LM-CE only	~30s	"does the package import + run at all"
2	`gsm8k_grpo/`	hand-written GSM8K (100 rows)	GRPO with `alpha=beta=0`	~60s	Plain-GRPO baseline reference
3	`gsm8k_grpo_with_sdpo/`	hand-written GSM8K (B=2)	GRPO + SDPO column	~25s	SDPO column wiring on synthetic prompts
4	`sdpo_with_real_traces/`	`ClaudeCodeIngester` reading a hand-authored session JSONL	GRPO + SDPO column	~30s	Partial V5 — ingestion path validated; wiring smoke (misaligned)
5	`sdpo_with_real_traces_production/`	`ClaudeCodeIngester` → adapter → `ComposerDataCollator` (with-error fixture)	GRPO + SDPO (production-aligned)	~2min	V5 closure — full production pipeline with error-site detection + properly-aligned SDPO mask

Recommended walk-through order: 1 → 2 → 3 → 4 → 5. Each builds on the previous in scope.

Why five?

#1 verifies the package is installable and the loss composition works at all (no SDPO, no DPO — pure LM-CE on a toy model).
#2 uses the production ComposerReplicationTrainer (TRL GRPOTrainer subclass) on a real GSM8K dataset with a regex-extract reward. This is the recipe a new user copy-pastes to start.
#3 drops the TRL trainer wrapper and calls compose_loss directly on hand-crafted hint contexts. The simplest place to see "alpha_sdpo=0.5 changes the loss" with all the wiring visible.
#4 uses real ingested Claude Code session JSONL (via ClaudeCodeIngester) but builds the SDPO batch by hand — demonstrates the ingester works but the SDPO mask covers misaligned content. Wiring smoke, not production-grade.
#5 is the production-grade sibling to #4: adds the claude_states_to_trace_examples adapter and uses ComposerDataCollator to build properly-aligned SDPO batches with hint injection at actual error sites. This is what you should copy for real training.

What every example asserts

Each run.py ends with a verification block that asserts:

The targeted channel(s) actually fired (sdpo_jsd > 0 when alpha_sdpo > 0)
The composed loss isn't trivially equal to lm_ce alone
Gradient norms are finite and non-zero at every step

Failure of any assertion exits non-zero and the script prints which channel didn't fire. This is the user's smoke test, not just a demo.

Production training

For real training (GPU, larger models, longer rollouts), use ComposerReplicationTrainer directly with a ComposerDataCollator that emits SDPO + DPO columns — exactly the path example #5 demonstrates. See docs/INTEGRATION_RECIPES.md for the production wiring patterns.