SDG SFT Round-2 LoRA Adapter (v0.1)

A second-round LoRA adapter on Qwen/Qwen3.5-9B-Base, trained on a corpus generated by zndx/sdg-sft-r1. Released specifically to demonstrate a negative result on naive iterated self-distillation β€” overall reward continues to rise, but only by raising the floor on hard scenarios. The ceiling on easy scenarios saturates and AUC discrimination regresses.

Status: v0.1, peer-review preview. Curator: @zndx

Headline result

Held-out 50-scenario evaluation, mean R across 4 generations per scenario:

Stage overall mean R good_mean bad_mean R_A pass rate AUC
Base 0.208 0.205 0.210 0.55 0.478
SFT-r1 0.289 0.311 0.268 0.68 0.590
SFT-r2 (this adapter) 0.318 0.309 0.327 0.76 0.475

The +10 % round-2 gain comes entirely from bad-scenario reward (+22 %). Good-scenario reward is effectively tied between r1 and r2 (0.311 vs 0.309). AUC discrimination regresses to ~0.48 β€” the model no longer distinguishes scenarios by quality. R_A pass rate continues climbing (0.55 β†’ 0.68 β†’ 0.76).

Why this happens (mode collapse)

Round-2 corpus statistics show the mechanism directly:

Corpus Source policy n samples R mean Unique template_ids
v2 base 665 0.522 171 / 540
v3 (used here) SFT-r1 740 0.513 88 / 540

The SFT-r1 policy strongly prefers a narrower set of catalog templates. When that policy generates the round-2 corpus, the new training distribution is half as diverse as the round-1 corpus. SFT-r2 then over-specialises on that narrower subset, raising its average reward on samples it has seen while losing generalisation flexibility.

This is a clean experimental demonstration of why naive iterated self-distillation requires explicit diversity preservation β€” mix-in of round-1 samples, anti-clustering penalties in the reward, or higher round-2 sampling temperature.

Training details

Identical hyperparameters to SFT-r1 except for the input corpus:

Hyperparameter Value
Base model Qwen/Qwen3.5-9B-Base
Source corpus rejection_samples_v3.jsonl (740 samples from SFT-r1)
Trainable params 29.1M / 8.98B (0.32 %)
LoRA rank r 16
Epochs 2
Total grad steps 94
Final train loss 0.139 (vs SFT-r1's 0.216 β€” 36 % lower)
Final token accuracy 96.4 %
Final entropy 0.126
Wall time 77.8 min on 2Γ— RTX 4090

Lower final train loss is consistent with mode collapse: the corpus is more self-similar, so SFT can fit it more tightly.

When to use this vs SFT-r1

  • For most generation tasks: use SFT-r1 (zndx/sdg-sft-r1). It generalises better and the AUC discrimination is meaningful.
  • For research on iterated self-distillation / mode collapse: use SFT-r2 to reproduce the negative result, or as a "before" baseline for a diversity-preserving variant.

Related artifacts

Citation

@misc{sdg-sft-r2-v01,
  title  = {SDG SFT Round-2 LoRA Adapter (v0.1) β€” Iterated Self-Distillation Mode-Collapse Baseline},
  author = {Hill, Ryan and contributors},
  year   = {2026},
  url    = {https://huggingface.co/zndx/sdg-sft-r2}
}
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zndx/sdg-sft-r2

Adapter
(10)
this model