Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 4,638 Bytes
ac05fbf d88715c ac05fbf d88715c ac05fbf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | """composer_replication — Composer 2.5 Replication Framework.
A research-grade, open replication of Cursor Composer 2.5's training recipe:
take any HuggingFace model, further-RL-train it using a 3-channel loss combining
1. RLVR / GRPO (channel 1, via TRL)
2. SDPO hint-distillation (channel 2, OPSD-based)
3. Multi-teacher trace-replay DPO (channel 3, this framework's contribution)
with optional DiLoCo / Streaming DiLoCo outer-loop sync for distributed runs.
See https://huggingface.co/Codeseys/composer-replication-framework for the
full project README, design docs, ADRs, and verification spikes.
## Two API surfaces, on purpose
This package exposes BOTH a verification-harness API and a production-trainer
API. Use the right one for your purpose:
### Verification harness (small, easy to call, NOT for real training)
`compose_loss(model, batch, alpha_sdpo, beta_replay)` is a free function
that returns `LossComponents(lm_ce, sdpo_jsd, trace_replay_dpo, total)`.
It stubs the GRPO channel with LM cross-entropy on response tokens (the
limit GRPO converges to under deterministic rewards) so you can verify
the 3-channel composition wires together WITHOUT spinning up TRL's full
reward + advantage machinery.
`build_batch(tokenizer)` produces a real chat-template-formatted batch
with all keys `compose_loss` may consume.
Use these for:
- CPU smokes on real HF models (Spike 006 / Spike 002a-mini-gpu)
- Unit testing custom loss-composition variants
- Debugging gradient flow through one of the three channels
- Anything where you want to call backward() on a real model without
spinning up TRL
### Production trainer (use for actual training runs)
`ComposerReplicationTrainer` is a `trl.GRPOTrainer` subclass that
overrides `_compute_loss(model, inputs)` to compose the same 3 channels
on top of TRL's real GRPO machinery. This is what you train models with.
Use this for:
- Real training runs on HF models with real rollouts + rewards
- Anything where the GRPO channel's policy-gradient signal matters
(i.e., not a memorization smoke)
The verification harness's `compose_loss` is intentionally NOT a
drop-in replacement for `_compute_loss` — they target different
phases of the framework's lifecycle.
## Quickstart (verification-harness API)
>>> from composer_replication import compose_loss, build_batch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
>>> batch = build_batch(tokenizer)
>>> components = compose_loss(model, batch, alpha_sdpo=0.1, beta_replay=0.05)
>>> components.total.backward()
See `examples/qwen_05b_quickstart/run.py` in the repo for a complete CPU
smoke (verification harness) and `spikes/002a-mini-gpu-smoke/run_gpu_smoke.py`
for a GPU smoke (verification harness, bf16, 50 steps).
For production-trainer usage, see `docs/INTEGRATION_ARCHITECTURE.md` Recipe A.
"""
from __future__ import annotations
# Loss composition (Spike 006)
from composer_replication.loss import LossComponents, compose_loss
from composer_replication.batch import build_batch
# Trace ingestion (Spike 007)
from composer_replication.ingestion.claude_code import (
SYSTEM_PROMPT,
ClaudeCodeIngester,
IngestionStats,
)
# OPSD / SDPO loss (verified extension from siyan-zhao/OPSD, MIT)
from composer_replication.opsd import generalized_jsd_loss
# Teacher replay (Spike 001 → trainer)
from composer_replication.teacher_replay import (
DEFAULT_TEACHERS,
DPOPair,
TeacherCallResult,
TeacherSpec,
TraceState,
extract_dpo_pairs,
replay_trace,
)
# Trainer (Spike 005)
from composer_replication.trainer import ComposerReplicationTrainer
# DiLoCo (Spike 008) — optional, requires torchft
try:
from composer_replication.diloco import make_diloco_outer_loop
_DILOCO_AVAILABLE = True
except ImportError:
_DILOCO_AVAILABLE = False
make_diloco_outer_loop = None # type: ignore[assignment]
__version__ = "0.1.0"
__all__ = [
# Core loss
"compose_loss",
"LossComponents",
"build_batch",
"generalized_jsd_loss",
# Trace ingestion
"ClaudeCodeIngester",
"IngestionStats",
"SYSTEM_PROMPT",
"TraceState",
# Teacher replay
"DEFAULT_TEACHERS",
"DPOPair",
"TeacherCallResult",
"TeacherSpec",
"extract_dpo_pairs",
"replay_trace",
# Trainer
"ComposerReplicationTrainer",
# DiLoCo (optional)
"make_diloco_outer_loop",
# Meta
"_DILOCO_AVAILABLE",
"__version__",
]
|