File size: 4,638 Bytes
ac05fbf
 
 
 
 
 
 
 
 
 
 
 
 
 
d88715c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac05fbf
 
 
 
 
 
 
d88715c
 
 
 
 
 
ac05fbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"""composer_replication — Composer 2.5 Replication Framework.

A research-grade, open replication of Cursor Composer 2.5's training recipe:
take any HuggingFace model, further-RL-train it using a 3-channel loss combining

    1. RLVR / GRPO (channel 1, via TRL)
    2. SDPO hint-distillation (channel 2, OPSD-based)
    3. Multi-teacher trace-replay DPO (channel 3, this framework's contribution)

with optional DiLoCo / Streaming DiLoCo outer-loop sync for distributed runs.

See https://huggingface.co/Codeseys/composer-replication-framework for the
full project README, design docs, ADRs, and verification spikes.

## Two API surfaces, on purpose

This package exposes BOTH a verification-harness API and a production-trainer
API. Use the right one for your purpose:

### Verification harness (small, easy to call, NOT for real training)

`compose_loss(model, batch, alpha_sdpo, beta_replay)` is a free function
that returns `LossComponents(lm_ce, sdpo_jsd, trace_replay_dpo, total)`.
It stubs the GRPO channel with LM cross-entropy on response tokens (the
limit GRPO converges to under deterministic rewards) so you can verify
the 3-channel composition wires together WITHOUT spinning up TRL's full
reward + advantage machinery.

`build_batch(tokenizer)` produces a real chat-template-formatted batch
with all keys `compose_loss` may consume.

Use these for:
- CPU smokes on real HF models (Spike 006 / Spike 002a-mini-gpu)
- Unit testing custom loss-composition variants
- Debugging gradient flow through one of the three channels
- Anything where you want to call backward() on a real model without
  spinning up TRL

### Production trainer (use for actual training runs)

`ComposerReplicationTrainer` is a `trl.GRPOTrainer` subclass that
overrides `_compute_loss(model, inputs)` to compose the same 3 channels
on top of TRL's real GRPO machinery. This is what you train models with.

Use this for:
- Real training runs on HF models with real rollouts + rewards
- Anything where the GRPO channel's policy-gradient signal matters
  (i.e., not a memorization smoke)

The verification harness's `compose_loss` is intentionally NOT a
drop-in replacement for `_compute_loss` — they target different
phases of the framework's lifecycle.

## Quickstart (verification-harness API)

    >>> from composer_replication import compose_loss, build_batch
    >>> from transformers import AutoModelForCausalLM, AutoTokenizer
    >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
    >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
    >>> batch = build_batch(tokenizer)
    >>> components = compose_loss(model, batch, alpha_sdpo=0.1, beta_replay=0.05)
    >>> components.total.backward()

See `examples/qwen_05b_quickstart/run.py` in the repo for a complete CPU
smoke (verification harness) and `spikes/002a-mini-gpu-smoke/run_gpu_smoke.py`
for a GPU smoke (verification harness, bf16, 50 steps).

For production-trainer usage, see `docs/INTEGRATION_ARCHITECTURE.md` Recipe A.
"""
from __future__ import annotations

# Loss composition (Spike 006)
from composer_replication.loss import LossComponents, compose_loss
from composer_replication.batch import build_batch

# Trace ingestion (Spike 007)
from composer_replication.ingestion.claude_code import (
    SYSTEM_PROMPT,
    ClaudeCodeIngester,
    IngestionStats,
)

# OPSD / SDPO loss (verified extension from siyan-zhao/OPSD, MIT)
from composer_replication.opsd import generalized_jsd_loss

# Teacher replay (Spike 001 → trainer)
from composer_replication.teacher_replay import (
    DEFAULT_TEACHERS,
    DPOPair,
    TeacherCallResult,
    TeacherSpec,
    TraceState,
    extract_dpo_pairs,
    replay_trace,
)

# Trainer (Spike 005)
from composer_replication.trainer import ComposerReplicationTrainer

# DiLoCo (Spike 008) — optional, requires torchft
try:
    from composer_replication.diloco import make_diloco_outer_loop
    _DILOCO_AVAILABLE = True
except ImportError:
    _DILOCO_AVAILABLE = False
    make_diloco_outer_loop = None  # type: ignore[assignment]

__version__ = "0.1.0"

__all__ = [
    # Core loss
    "compose_loss",
    "LossComponents",
    "build_batch",
    "generalized_jsd_loss",
    # Trace ingestion
    "ClaudeCodeIngester",
    "IngestionStats",
    "SYSTEM_PROMPT",
    "TraceState",
    # Teacher replay
    "DEFAULT_TEACHERS",
    "DPOPair",
    "TeacherCallResult",
    "TeacherSpec",
    "extract_dpo_pairs",
    "replay_trace",
    # Trainer
    "ComposerReplicationTrainer",
    # DiLoCo (optional)
    "make_diloco_outer_loop",
    # Meta
    "_DILOCO_AVAILABLE",
    "__version__",
]