File size: 35,491 Bytes

# Vision Validation: Does the Framework Encapsulate the Original Brief?

> **Status:** Self-audit, 2026-05-25 (Wave 6).
> **Question:** Does what we've built reflect what was originally asked for, or did we drift?
> **Method:** Recover original brief verbatim → atomic-clause decomposition → traceability matrix → adversarial self-review → user-journey simulation → concrete pass/fail scorecard with gap-closing actions.

This document is **uncomfortable on purpose.** Unit tests answer "does the code work"; this answers "is the code doing what was asked." Those are different questions, and skipping the second is a common failure mode in research projects that drift between brief and ship.

## 1. The original vision, recovered verbatim

From the originating message in session `20260525_005800_723eccb8` (timestamp `1779696689.2033243`):

> *"can you dive into Composer 2.5 and understand what makes it so much better? I want to see if I can take that and combine it with **diloco (decoupled, open, any variant of diloco)** and monarch/torchforge/openenv/VeRL/TRL and make a framework that we can use to further RL training of models to take them to the next level. One of the ideas that I had that might be a parallel to this is to **use traces from an llm-application usage** then **replay the traces with different models** to see at each llm-step what the llm would do. by doing this we get distillation data from any number of models that could be used to train the target model further. can we reserach all of this and see how we could try to set this up as a framework **to take any model from huggingface and be able to further RL train it to get results to Composer 2.5 which is post-trained kimi-k2.5**"*

Atomic-clause decomposition (the unit each deliverable maps onto):

| Clause | Vision element | Verbatim phrasing |
|---|---|---|
| **V1** | Understand Composer 2.5 internals | *"dive into Composer 2.5 and understand what makes it so much better"* |
| **V2** | Integrate **DiLoCo** (any variant: decoupled, open, etc.) | *"combine it with **diloco (decoupled, open, any variant of diloco)**"* |
| **V3** | Integrate **Monarch / TorchForge / OpenEnv / VeRL / TRL** | *"and monarch/torchforge/openenv/VeRL/TRL"* |
| **V4** | Build it as a **framework** (not a one-off recipe) | *"make a framework that we can use to further RL training of models"* |
| **V5** | **Trace-replay from real llm-application usage** as the novel idea | *"use traces from an llm-application usage then replay the traces with different models to see at each llm-step"* |
| **V6** | N-teacher distillation from those traces | *"distillation data from any number of models that could be used to train the target model further"* |
| **V7** | Research the whole space rigorously | *"can we reserach all of this"* |
| **V8** | **Generalize to any HF model**, target Composer-2.5-quality outcomes | *"to take **any model from huggingface** and be able to further RL train it to get results to Composer 2.5"* |

## 2. Traceability matrix — what's where in the repo

Map each vision clause to its concrete deliverable. Citations are file paths in this repository.

| Vision | Status | Deliverable evidence | Honest assessment |
|---|---|---|---|
| **V1** Composer 2.5 internals | 🟢 Strong | `research/01-composer-2.5.md` (parallel-research dispatch), `docs/COMPOSER_RECIPE_MAPPING.md` (primary-source audit, every claim tagged `[BLOG-VERIFIED]` / `[INFERRED]` / `[EXTRAPOLATED]`) | Caught and patched the SDPO/OPSD discovery that the initial dispatch missed. Audit notice on the original research note. **Solid.** |
| **V2** DiLoCo integration | 🟡 **Deferred** | `research/02-diloco-family.md` (covered conceptually); `framework/composer-replication-framework.md` § "Distributed sync" (decision: defer until multi-cluster); `docs/INTEGRATION_ARCHITECTURE.md` (mentions DiLoCo as v0.2) | We decided DiLoCo is v0.2 work. **The decision is documented but it is a deviation from the original brief.** The brief said "combine it with diloco," not "consider diloco." See § 4.1. |
| **V3** Monarch / TorchForge / OpenEnv / VeRL / TRL | 🟢 Strong | `research/03-monarch-torchforge-openenv.md`, `research/04-verl-trl.md`, `docs/INTEGRATION_ARCHITECTURE.md` (extension-point matrix), `spikes/005-integrated-trainer-skeleton/` (TRL + VeRL working code) | TRL and VeRL paths are coded; OpenEnv is the env substrate; Monarch + TorchForge are correctly assessed (Forge is "development paused"). **Five out of five components addressed; two have working code, three are correctly characterized as patterns/reference.** |
| **V4** Framework, not one-off recipe | 🟡 **Skeleton, not framework** | `spikes/005-integrated-trainer-skeleton/` has component-modular code (`opsd_loss.py`, `teacher_replay.py`, `data_collator.py`, two trainer paths); `docs/INTEGRATION_ARCHITECTURE.md` documents the composition contract | What we have is a **trainer skeleton with verified composition**, not yet a productized framework with installable package, CLI, examples directory. See § 4.2. |
| **V5** Real llm-application traces | 🔴 **Substituted with synthetic** | `spikes/001-teacher-replay-cost/synthesize_trace.py` builds 50 hand-crafted SWE-bench-lite-shaped states; `spikes/002a-trace-collection-trl/README.md` plans the real-trace path but unrun | **The brief explicitly says "traces from an llm-application usage."** We validated the *replay mechanism* on synthetic states. Real traces from a real agentic application are not yet ingested. See § 4.3. |
| **V6** N-teacher distillation | 🟢 Strong | `spikes/001-teacher-replay-cost/` ($0.98/trace verified, 150 real OpenRouter calls, 0 errors); `spikes/005-integrated-trainer-skeleton/teacher_replay.py` (DPO-pair extractor, 7 unit tests); economic feasibility is the strongest empirical result so far | Verified. The novel claim's economic floor is established. **Strongest part of the work.** |
| **V7** Rigorous research | 🟢 Strong | 5 deep-dives by 5 LLM families (`research/01..05`), 16KB methodology paper (`publications/PAPER_v0.md`), recipe-mapping audit, integration architecture, 38/38 unit tests, full citation graph | Process discipline visible: blog audit caught primary-source omissions, DeepWiki audits verify framework extension surfaces, every claim is sourced. **Solid.** |
| **V8** Any HF model → Composer-quality | 🔴 **Architecturally yes, empirically untested** | `spikes/005-integrated-trainer-skeleton/` plans `Qwen3-7B` (v0.0) → `Qwen3-32B` (v0.1); but the only smoke test is on a 10K-parameter custom `TinyLM`. No real HF model has been touched yet. | **Massive gap between architecture and evidence.** The framework targets `AutoModel.from_pretrained(...)` but has never loaded one. See § 4.4. |

## 3. Honest scorecard

Ten concrete pass/fail tests covering both "do we encapsulate the vision" and "is what we have actually correct":

| # | Test | Pass/Fail | Evidence | Gap-closer (if fail) |
|---|---|---|---|---|
| 1 | Original brief is recoverable verbatim and clause-decomposed | ✅ | § 1 of this doc | — |
| 2 | Each of {Composer, DiLoCo, Monarch, Forge, OpenEnv, VeRL, TRL, trace-replay, HF-base} has a documented deliverable | ✅ | § 2 traceability | — |
| 3 | The Composer 2.5 mechanism (SDPO/OPSD link) is correctly identified | ✅ | `docs/COMPOSER_RECIPE_MAPPING.md` § 2.1; primary-source audited | — |
| 4 | The novel TR-DPO channel is empirically feasible (not just paper) | ✅ | spike 001: 150 real calls, $0.98/trace, 0 errors | — |
| 5 | All three reward channels compose and don't fight each other | ✅ | spike 005 `test_loss_composition_smoke.py`: 5-step train decreases loss | — |
| 6 | DiLoCo is integrated *somewhere* in the runnable stack | ❌ | Conceptually documented in `research/02`; **no code** | Spike 008 (proposed § 6) — Streaming DiLoCo outer loop on a stub trainer |
| 7 | A real HuggingFace model can load + run a single forward pass through `ComposerReplicationTrainer` | ❌ | TinyLM only; never `AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.5B")` | Spike 006 (proposed § 6) — real-HF-model smoke test |
| 8 | At least one trace from real LLM-application usage is ingested end-to-end | ❌ | Synthetic 50-state fixture only | Spike 007 (proposed § 6) — real-trace ingestion from Cline / OpenHands / Claude Code session export |
| 9 | The framework is *installable* (a user can `pip install` and have working entrypoints) | ❌ | No `pyproject.toml`, no installable package | Wave 7 — packaging |
| 10 | A non-author can complete the "I have Qwen3-7B, I want a Composer-style variant" journey by reading docs | ⚠️ partial | Possible to read your way through `INTEGRATION_ARCHITECTURE.md` + `composer_trainer.py`, but no end-to-end runnable example | Spike 006 + a `examples/qwen3_7b_quickstart.md` |

**Score: 5/10 pass, 4/10 fail, 1/10 partial.** The framework's design is solid; the gap is between design and runnable artifact.

> **Update 2026-05-26 — Wave 7+8+9+10 closeout (deep work loop) + cross-model audit**
>
> Initial self-claim was 5/10 → 9/10. A cross-model adversarial review (Phase 11 of the deep work loop, doc at `docs/research/WAVE_7_10_FINAL_REVIEW.md`) found three of those ✅s were letter-of-the-law rather than spirit. **Honest re-scoring: 5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.**
>
> | # | Test | Status | New evidence + honest caveat |
> |---|---|---|---|
> | 6 | DiLoCo integrated in runnable stack | ⚠️ partial | Spike 008 has `composer_replication.diloco.make_diloco_outer_loop` wrapping `torchft.local_sgd.DiLoCo`, with a 5/5 test suite that pins the sign convention. **But**: the BACKLOG required a *2-replica convergence* smoke; what shipped is a 1-replica machinery test with a passthrough no-op `allreduce`. The recon doc's "ready-to-paste 2-replica pattern" hits a single-process post-hook sequencing bug we couldn't fix without rewriting torchft. The DiLoCo wrapper is also **not yet integrated with `ComposerReplicationTrainer`** — it's an independent context manager. Calling V2 ✅ overstates: real DiLoCo training is GPU-multi-process which we haven't touched. |
> | 7 | Real HF model loads + runs through `compose_loss` | ✅ (with caveat below) | Spike 006 — Qwen2.5-0.5B-Instruct on CPU, 5 backward steps, loss 0.7390 → 0.0031, 9/9 tests. **Caveat**: SDPO channel is `0.0` throughout (silently disabled by ctx_student vs ctx_teacher shape mismatch — correct fallback, but means SDPO is not exercised end-to-end on a real model anywhere in the repo yet). DPO uses dummy reference logprobs. The 5-step loss decrease on a fixed batch is closer to "memorization works" than "the 3-channel composition is correct." Still: the framework now demonstrably loads real HF models, which it didn't before. V8 is closed in the literal sense. |
> | 8 | Real LLM-application trace ingested end-to-end | ❌ spirit | Spike 007 — `ClaudeCodeIngester` ingests real Claude Code session JSONL → `TraceState` records, 15/15 tests including a real-session smoke. **But**: BACKLOG acceptance criterion #3 said "end-to-end smoke: real trace → ingester → collator → 1-step `compose_loss`." That last hop is **not tested**. The spike stops at "ingester emits TraceStates correctly." Closing V5 in spirit needs a 50-LOC test that pipes ingested records all the way through the loss. Open. |
> | 9 | Framework is *installable* with working entrypoints | ✅ | Wave 10 — `pyproject.toml` ships `composer_replication` package, `pip install -e .` works, `examples/qwen_05b_quickstart/run.py` runs end-to-end via the package API. (Caveat acknowledged: `compose_loss` is documented as a verification harness, not production. The production loss is `ComposerReplicationTrainer._compute_loss`.) |
> | 10 | Non-author can complete the "I have X, I want a Composer variant" journey | ❌ spirit | Quickstart works for "verify the loss composition runs" but not for "train a real model" — that requires real GRPO rollouts, real teacher calls, and GPU. The brief's intended user wants the latter. We have not closed that path. |
>
> **The remaining 1/10 + 2/10 spirit gaps + the unverified 9/10 ⚠️** are the post-replication GPU phase: Spike 002a/b (real trace collection on GPU), Spike 003 (DPO-pair signal density), Spike 004 (A/B SWE-bench-lite), and a real-multi-process DiLoCo test. Those are GPU-budget-gated and out of scope for the deep work loop's CPU-only constraint.
>
> **Time spent on Wave 7-10**: ~1 session. **No GPU spend.** Modal evaluated but rejected for the smoke phase (ADR-001 — local 5090 wins on iteration cycle 10× over Modal L4 for 0.5B verification work). **The local 5090 was also not used** — Spike 002a-mini (the planned local-GPU smoke) was not run. The framework as of this commit has zero GPU evidence of any kind. That is honest about where this work lands: **a tested, installable methodology repo with real CPU smokes and primary-source-validated research, not a trained model.**
>
> **Update 2026-05-26 (later) — Wave 12 closeout, post-cross-model-review fixes**
>
> Cross-model review's priority items 3, 4, 5, 9 addressed; V1-V8 brief now
> tracks at **6/8 closed, 2/8 partial**. Coverage matrix:
> [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md), substrate-by-substrate
> coverage: [`docs/V3_SUBSTRATE_COVERAGE.md`](V3_SUBSTRATE_COVERAGE.md).
>
> | Item | Closed by |
> |---|---|
> | #3 SDPO never exercised on real model + tautology critique | **Spike 006-strict** (`spikes/006/tests/test_strict.py`) — 3 tests on real Qwen2.5-0.5B-Instruct: alternating-batch loss decrease, SDPO channel actually fires (sdpo_jsd > 0), SDPO off-vs-on total differs. **All 3 pass on CPU.** This was the single largest evidence gap from the review — **closed in spirit**, not just letter. |
> | #4 Zero GPU evidence | **Spike 002a-mini-gpu-smoke** (`spikes/002a-mini-gpu-smoke/run_gpu_smoke.py`) — 50 steps on RTX 5090 sm_120 in bf16. Loss 0.7354 → 0.00034 (99.95% reduction). Peak VRAM 5.31 GB. Median 480 ms/step. ADR-001's "use local 5090" claim now empirically verified. |
> | #5 run.log vs verdict.md numerical inconsistency | `torch.manual_seed(42)` + `random.seed(42)` pinned in both `spikes/006/run_smoke.py` and `examples/qwen_05b_quickstart/run.py`. Loss curves now reproducible. |
> | #9 V5 ingester→loss e2e test missing | **Spike 007 e2e** (`spikes/007/tests/test_e2e_with_loss.py`) — 2 tests pipe ingested `TraceState` records all the way through to `compose_loss` + backward. Synthetic fixture (3 states) + real Claude Code session (3 sampled states from a 628-line trace). **Both pass.** Closes V5 in spirit. |
>
> **Honest re-scoring after Wave 12**: 5/10 → **8/10 ✅** + 1/10 ⚠️ (Spike 008 multi-replica) + 1/10 ❌ (test 10 "non-author can complete journey for any HF model — only verified on 0.5B; the 7B+ path is GPU-budget gated"). Better than the 7/10 post-Wave-11 honest re-rating, by 1 point because tests 7, 8, and the SDPO-firing aspect of test 7 all materially improved.
>
> **Total tests passing**: 77 (38 Spike 005 + 9 Spike 006 + 3 Spike 006-strict + 15 Spike 007 + 2 Spike 007 e2e + 5 Spike 008 + 5 quickstart-via-package). **Plus** 1 GPU smoke on real hardware.
>
> **Items deferred to GPU/post-replication phase**: cross-model review items 6 (Claude Code circularity in code), 7 (compose_loss naming — addressed via package docstring rather than rename to keep API stable), 8 (dual sources of truth — same reason: spike copies are verification harnesses by design), 10 (sign-convention docstring — already addressed in Wave 11).

## 4. The four real gaps, each examined

### 4.1 V2: DiLoCo deferral — is this a drift?

**The drift:** the brief says *"combine it with diloco."* The framework documents say DiLoCo is v0.2 work, deferred until training spans multiple clusters.

**The defense:** Streaming DiLoCo's outer loop is only useful when training cannot fit on one cluster. For a Qwen3-7B (v0.0) or Qwen3-32B (v0.1) run on a single 8×H100 node, FSDP2 is sufficient — adding a DiLoCo outer loop would be bolt-on infrastructure with no measurable benefit. PRIME-RL (which we recommend as the substrate) has DiLoCo-shape sync between geographically distributed inference workers, but the trainer itself is single-cluster FSDP2. INTELLECT-2 (Prime Intellect's 32B QwQ run) is the only production-scale precedent for trainer-side DiLoCo, and even there the headline contribution was the orchestrator/trainer/inference split, not the gradient sync.

**The honest read:** the deferral is technically defensible, but it is **a deviation from the brief** and we should not pretend otherwise. The user explicitly said "any variant of diloco" — meaning the brief permits weak forms (e.g., outer-loop sync between geographically distributed inference workers, even with a single trainer cluster). We could add that *now*, on the existing v0.0 architecture, without waiting for v0.2.

**Concrete gap-closer:** **Spike 008** — implement Streaming DiLoCo outer-loop sync between two simulated "clusters" (could literally be two FSDP groups on the same machine for the smoke test). Validates that the outer-loop integrates with PRIME-RL's vLLM↔FSDP2 weight broadcast (SHARDCAST) without breaking the GRPO + SDPO + TR-DPO loss composition. Estimated effort: ~2 days, no new GPU budget if we use a tiny model. Closes V2.

### 4.2 V4: framework vs. skeleton

**The drift:** the brief says *"make a framework."* We have a "trainer skeleton" with component-modular code and 38 unit tests but no installable package, no CLI, no examples directory, no quickstart that resolves to a working training run.

**The honest read:** spike 005 is genuinely modular (`opsd_loss.py`, `teacher_replay.py`, `data_collator.py`, `composer_trainer.py`, `composer_adv.py` are independent components composing through clean interfaces). But "framework" carries connotations of installability, examples, documentation site, versioned releases. We have the *components* of a framework; we have not assembled them into the *artifact* a third party would call a framework.

**Concrete gap-closer:** **Wave 7 — packaging.** Add a top-level `pyproject.toml` with `composer-replication-framework` as a package; expose `from composer_replication_framework import ComposerReplicationTrainer, ComposerDataCollator, generalized_jsd_loss`; ship `examples/qwen3_7b_swe_bench_lite/` with a runnable `train.py`. Estimated effort: ~half a day once spike 006 (real-model smoke) lands. Closes V4 properly.

### 4.3 V5: synthetic states vs. real llm-application traces

**The drift:** the brief is unambiguous: *"use traces from an llm-application usage."* Spike 001 used **50 hand-crafted SWE-bench-lite-shaped states**, not real traces from a real agentic coding application.

**The defense:** the goal of spike 001 was to measure the *economic floor* of N-teacher replay. Synthetic states with realistic shape and token-count distributions are sufficient for that purpose — we get unbiased latency and cost numbers. The shape of real traces (multi-turn, ~250-500 tokens of context per state, tool-call decision points) was matched.

**The honest read:** spike 001's economic verdict generalizes to real traces *if* their shape is similar. But the brief's intent is bigger than "measure cost" — the brief envisions ingesting real traces (e.g., from Cursor session exports, OpenHands traces, Claude Code transcripts, Cline rollouts) and harvesting them for training data. **We have the *replay mechanism* but no *ingestion pipeline*.** Real traces have warts our synthetic ones don't: malformed tool calls, mid-rollout context truncation, vendor-specific schema, PII to scrub.

**Concrete gap-closer:** **Spike 007 — real-trace ingestion.** Pick one real source (proposal: Claude Code session JSONL exports, since I have access to my own and they're well-structured), write an adapter that converts to the `TraceExample` schema the data collator expects, run it through spike 005's pipeline. Validates the real → synthetic → trainer path works without contortion. Estimated effort: ~1 day, no GPU. Closes V5 substantively.

### 4.4 V8: HF model generalization — architecture vs. evidence

**The drift:** the brief targets *"any model from huggingface"* with Composer-2.5-quality outcomes. The architecture is designed for `AutoModelForCausalLM.from_pretrained(...)`, but the only smoke test is on a 10K-parameter custom `TinyLM`.

**The defense:** the integration claim ("all three channels compose, ablate, and train without divergence") is *generic*. It holds for any sufficiently-differentiable model. The TinyLM is a stand-in for any HF model. Real-HF-model testing is GPU-bound work that's properly in the spike 002+ tier.

**The honest read:** "the architecture is generic" is theoretically true and practically dangerous. Real HF models have:
- Tokenizer chat templates that the data collator must respect (`StubTokenizer` in spike 005 fakes `apply_chat_template`).
- Real vocab sizes (Qwen3 = 152K vs TinyLM's 64) where top-k restrictions in the SDPO loss matter.
- FlashAttention-2 attention paths the OPSD reference relies on.
- vLLM rollout integration for the outer GRPO loop.

Any of these can break the "tested in the small" claim when scaled up. **We should run a single forward pass and a single loss computation on a real (small, but real) HF model before claiming the framework generalizes.**

**Concrete gap-closer:** **Spike 006 — real-HF-model smoke.** Load `Qwen/Qwen3-0.5B` (or `Qwen/Qwen2.5-0.5B-Instruct` if 3 isn't out yet), wire up a real `AutoTokenizer` (not `StubTokenizer`) to the data collator, run a single forward pass + a single backward pass through `composer_total_loss` with all three channels active. CPU-only is fine; the test is wiring correctness, not training. Estimated effort: ~half a day, no GPU rental. **Closes V8's evidence gap without requiring spike 002–004's GPU budget.**

## 5. User-journey walkthrough — find the breaks

Simulate the brief's intended user: *"I have Qwen3-7B. I want a Composer-style variant. Walk me through it."*

The journey, as it would actually run today:

| Step | What the user does | What happens | Break? |
|---|---|---|---|
| 1 | Lands on the HF repo | Reads the README | ✅ Clear status, links to publications |
| 2 | Reads `publications/PAPER_v0.md` | Understands the architecture | ✅ Comprehensive |
| 3 | Reads `docs/INTEGRATION_ARCHITECTURE.md` | Picks TRL path | ✅ Clear extension-point matrix |
| 4 | Clones the repo, navigates to `spikes/005-integrated-trainer-skeleton/` | Reads the skeleton README | ✅ |
| 5 | Tries to install dependencies | **No `pyproject.toml`. No `requirements.txt` at the spike level.** | ❌ Break — has to figure out deps from imports |
| 6 | Tries to run an example | **No `examples/` directory.** Skeleton tests pass but there's no end-to-end "load Qwen3-7B + train one step" script | ❌ Break — has to assemble it themselves from components |
| 7 | Tries `from trl_path.composer_trainer import ComposerReplicationTrainer` | Works iff TRL is installed | ⚠️ Latent: needs TRL ≥ some version, undocumented |
| 8 | Wires up an `AutoModel` and `AutoTokenizer` | **Untested — `data_collator.py` falls back to a stub tokenizer code path; real chat templates may not be exercised** | ⚠️ Latent risk |
| 9 | Tries to source training data | **No instructions for trace collection (spike 002 unrun); synthetic stub fixture is in spike 001 but not labeled as a starter dataset** | ❌ Break |
| 10 | Realizes they need teacher API credentials | `OPENROUTER_API_KEY` envvar — documented in `teacher_replay.py` docstring but not in a top-level setup guide | ⚠️ Findable but suboptimal |
| 11 | Wants to run the spike-004 A/B comparison | **No script. No config template. Spike 004 README is planning notes, not runnable code** | ❌ Break — they'd have to write the experiment harness themselves |

**Verdict:** the architecture is reachable from docs, but a third-party can't currently complete the journey end-to-end without significant assembly. The framework needs **packaging + examples + a quickstart** to credibly claim "any HF model" generalization.

## 6. Proposed gap-closing spikes (no GPU budget required)

These three sub-projects close the four real gaps identified in § 4 and don't need GPU rental — they're CPU-only or use $5 of API. They can run *before* spike 002–004 to make the framework actually deliver on the brief.

### Spike 006 — Real-HF-model smoke (closes V8)

- **Goal:** load `Qwen/Qwen2.5-0.5B-Instruct` via `AutoModelForCausalLM` + `AutoTokenizer`, wire to `ComposerDataCollator` with real `apply_chat_template`, run one forward + one backward through `composer_total_loss(α=0.1, β=0.05)`, verify finite gradient on every parameter.
- **Hardware:** CPU sufficient (model fits in ~1GB RAM).
- **Effort:** ~half a day.
- **Pass criterion:** test in `tests/test_real_hf_model_smoke.py` passes; loss is finite and decreases over 5 steps on a fixed batch.

### Spike 007 — Real-trace ingestion (closes V5)

- **Goal:** Write `adapters/claude_code.py` (or `cline.py` or `openhands.py` — pick one) that converts a real session export into a list of `TraceExample` dicts. Run spike 001's `replay_trace` on 5 real states. Run spike 005's pipeline end-to-end on the resulting batch.
- **Hardware:** CPU.
- **Effort:** ~1 day.
- **Pass criterion:** `python adapters/claude_code.py < session.jsonl > traces.jsonl` produces collator-compatible output; `composer_total_loss` runs on it without error; one DPO pair successfully extracted from teacher disagreement.

### Spike 008 — Streaming DiLoCo smoke (closes V2)

- **Goal:** Bolt a Streaming DiLoCo outer loop onto the `composer_total_loss` smoke test. Use two FSDP process groups on the same node as a stand-in for two clusters. Verify pseudo-gradient sync every H steps doesn't break loss composition.
- **Hardware:** CPU sufficient (TinyLM scale).
- **Effort:** ~2 days (DiLoCo's PyTorch reference is a ~200 LOC outer loop).
- **Pass criterion:** 5-step training run with α=0.1, β=0.05, DiLoCo H=2 still decreases loss; pseudo-gradient sync produces no NaN.

After 006 + 007 + 008, the scorecard goes from 5/10 pass to 8/10 pass. Wave 7 (packaging) closes #9 and #10. **Then the framework genuinely delivers on the brief, before any of the GPU-bound spikes 002–004 run.**

## 7. Adversarial self-review

The five strongest objections to "this framework encapsulates the vision," steelmanned and answered:

### Objection 1: "You spent more time on publication materials than on closing real gaps."

**Steelman:** Wave 5 produced 1,200 lines of publication materials. Wave 6 (this doc) is more publication. Meanwhile spike 006/007/008 — actual integration work — is unwritten. Optimizing for paper-readiness over framework-readiness is a known failure mode.

**Answer:** Conceded with caveat. Wave 5 was at the user's explicit request. This wave (vision validation) was the first introspective check; gap-closers are now scoped and ready to execute. The next wave should be 006 (real-model smoke) before any further publication work. **Reordering accepted.**

### Objection 2: "Spike 001's $0.98/trace doesn't generalize to real traces. You measured what's cheap (50 short hand-crafted states), not what's real (10K-token rollouts with embedded code blobs)."

**Steelman:** Real Cursor / Cline / OpenHands rollouts can have 10–100K tokens of context. At Opus pricing ($15/Mtok input), a 50-step replay over 50K-token rollouts costs $37.50 in input tokens alone *per teacher per trace*. With 3 teachers that's ~$112 per trace, dwarfing the synthetic-state estimate.

**Answer:** Largely valid. Spike 001's verdict is *valid* but its *generalization* to real traces is unproven. Spike 007 (real-trace ingestion) is the right next experiment. Once it lands, we can rerun spike 001's analysis on real-shaped traces and report an updated cost number. The framework's economic claim should be updated to include both the synthetic-floor result *and* a real-trace measurement when available.

### Objection 3: "VeRL is recommended for v0.2 but the only smoke test is in TRL. The VeRL `composer_adv.py` has zero unit tests."

**Steelman:** `verl_path/composer_adv.py` ships as untested code. The DeepWiki audit confirmed extension surfaces but didn't validate that the resulting `compute_grpo_composer_advantage` function correctly composes with VeRL's `compute_advantage` dispatcher. Anyone choosing the VeRL path is running unverified code.

**Answer:** Correct. The VeRL path is a *design verified by primary-source audit*, not *a tested implementation*. Closing this requires installing VeRL + Ray + a real model — non-trivial. Reasonable interim mitigation: explicitly mark `verl_path/` as `STATUS: design-only` in its README and warn users the TRL path is the only tested one. Long-term gap-closer: spike 002b's PRIME-RL/VeRL run is the natural place to validate.

### Objection 4: "The 'integration with Monarch and TorchForge' claim is paper-thin. Forge is paused. Monarch K8s is documented but not used. Where's the runnable Monarch ActorMesh?"

**Steelman:** The integration matrix in `INTEGRATION_ARCHITECTURE.md` mentions Monarch ActorMesh patterns for SDPO and TR-DPO, but no code. TorchForge is "paused" so we route around it. The actual integration with Meta's stack is documentation, not code. This is V3 partial, not V3 strong.

**Answer:** Half right. Forge being paused upstream is genuinely orthogonal to our work — we can't depend on a paused project. But the Monarch integration *is* paper-only. The honest framing: we *integrate with the design philosophy* of Monarch (single-controller actor-mesh orchestration) by ensuring the components are *placeable* on a Monarch mesh, but we have no runnable Monarch code. Should soften the V3 claim in the README from "integrate with all five" to "integrate with TRL + VeRL + OpenEnv (coded); align with Monarch + Forge philosophy (documented)."

### Objection 5: "You claim 'any HF model from huggingface' but the architecture is implicitly designed around a chat-template-having causal LM. What about base models without chat templates? Encoder-decoder models? VLMs?"

**Steelman:** `data_collator.py::_tokenize_messages` calls `apply_chat_template` and falls back to plain text concat. For a base model without a chat template (e.g. `gpt2`, `Qwen3-0.5B-Base`), the fallback path may not produce coherent input. For encoder-decoder models the whole "single forward, gather logits" assumption breaks. VLMs add another dimension. The "any HF model" claim is overstated.

**Answer:** Accurate. The framework is designed for **causal LM with chat templates**, which is the standard target for agentic-coding RL post-training. The brief's "any model from huggingface" should be re-scoped to **"any HuggingFace causal LM with a chat template (Qwen, Llama, Mistral, DeepSeek, Phi, Gemma families)."** README + paper should say this explicitly. Encoder-decoder, base-no-chat-template, and VLM support is out of scope for v0.0/v0.1. Adding it would be a separate research direction.

## 8. What this validation actually says

The framework **partially encapsulates** the vision. Strengths:

- Composer 2.5 mechanism is correctly identified and grounded in published prior art (V1 ✅)
- Five-component agentic-RL stack is mapped with primary-source-audited extension points (V3 ✅)
- N-teacher distillation channel is empirically feasible (V6 ✅)
- Research process is rigorous, sourced, and self-correcting (V7 ✅)
- Composition smoke testing on a tiny model is a real empirical claim (parts of V4, V8 ✅)

Real gaps the brief asked for that we punted on:

- DiLoCo deferral is documented but is a deviation (V2 🟡)
- "Framework" is a skeleton; no installable package or examples (V4 🟡)
- Real llm-application traces are not ingested anywhere (V5 🔴)
- "Any HF model" is architecturally generic but never tested on a real HF model (V8 🔴)

These gaps are closable without GPU rental via three CPU-only spikes (006, 007, 008) totaling ~3.5 days of effort. After they land + a packaging wave (Wave 7), the scorecard goes from 5/10 to 9/10 and the framework genuinely delivers on the original brief.

The gap that remains unclosed at 9/10 is #4 ("the framework actually trains better models than baseline GRPO") which requires the GPU spikes 002–004. That's correctly out-of-scope for vision encapsulation — it's *empirical validation of the methodology*, not encapsulation of the brief.

## 9. Recommended next moves, ordered

In recommended-do-next order:

1. **Spike 006 (real-HF-model smoke)** — half a day, CPU-only. Closes V8's biggest credibility gap and surfaces any tokenizer / chat-template / vocab-size issues hiding in the skeleton. *Highest value per hour.*
2. **Spike 007 (real-trace ingestion)** — 1 day, CPU + ~$5 OpenRouter. Closes V5's real-vs-synthetic gap. Picks one source (Claude Code session JSONL? Cline transcripts?) and writes the adapter.
3. **Soften over-claims in README and paper** — half-hour. README: "TRL coded, VeRL design-only." Paper: "any causal LM with a chat template." `verl_path/README.md`: add `STATUS: design-only — validate via spike 002b before production use`.
4. **Wave 7 (packaging)** — half a day, after 006 + 007 land. `pyproject.toml` + `examples/qwen3_05b_quickstart/` directory + entry-point exposure.
5. **Spike 008 (Streaming DiLoCo smoke)** — 2 days, CPU-only. Closes V2's biggest deviation from the brief. Lowest priority because the deferral is technically defensible, but worth doing for completeness.

Items 3 + 4 are documentation/packaging chores. Items 1 + 2 + 5 are real engineering. None require GPU budget. Total: ~5 days of sequential effort to take the framework from 5/10 to 9/10 vision encapsulation.

After that, GPU-bound spikes 002–004 are what move 9/10 → empirical validation of the methodology itself, which is a separate project phase (the "v0.1 follow-up paper" framing in the publication wave).

## 10. How to keep validating going forward

This document captures one snapshot. Vision encapsulation drifts as projects evolve. Three mechanisms to keep checking:

- **Re-audit at every release wave.** Each commit message currently includes a wave number. Wave 7 (packaging), wave 8 (post-spike-006), etc. should each end with a one-paragraph "vision-check delta" — what gaps closed, what new ones opened, whether the README's claims still match the code.
- **External review via the HF Discussions tab.** The pre-experimental release posts (drafted in `publications/`) explicitly ask for critical reads of the integration architecture and adjacent-work pointers. Specifically pin one Discussion thread for "vision encapsulation feedback" — invite people to point out where the framework deviates from its stated goals.
- **Calibration check post-spike-004.** When the GPU spikes finally run and produce a result (positive or negative), revisit this document and update the scorecard. If TR-DPO doesn't beat plain GRPO, V6 needs to be downgraded. If it does, V8 should incorporate the empirical evidence. The validation is a living artifact, not a one-shot audit.

---

*Self-audit complete. Honest read: 5/10 today, 9/10 reachable in ~5 days of CPU-only work, last 1/10 is the empirical question that gates the v0.1 paper.*