Building ProofKit: fine-tuning a small model without losing the product

Community Article Published June 12, 2026

Field notes from the Hugging Face Build Small Hackathon.

TL;DR — I fine-tuned and distilled small models for a real product (ProofKit, a work-sample generator for job seekers), trained across my own RTX 2070 Super and rented Modal GPUs, distilled a 20B teacher down to a free, CPU-runnable 0.5B served through llama.cpp, and let a multi-judge eval settle every "is X better than Y?" argument. The headline: a tiny distilled model, running free and offline, ties a fine-tuned 20B and beats a 3B baseline on ProofKit's own rubric — once you fix the training data, not the method.

🎯 The problem I kept coming back to

ProofKit started with a very human problem: career changers often have the skills for a role, but not the clean resume line that proves it. My fiance had sales, operations, hourly-work grit, a WGU Business Management degree, a coursera Project Management Certificate and even a PMI AI Project Management Certification but kept hearing some version of "not enough experience" for Project Manager and Business Analyst roles he could plausibly grow into, even for entry level positions...

So the product idea became: stop only polishing claims. Help people build proof.

But "build proof" needs unpacking, because the obvious reading is wrong - and my fiancé called it out: recruiters don't ask you for a document proving you can do the job. He's never once been handed that request. The proof isn't for a gatekeeper who demands it. It's leverage in a job market where the old funnel has quietly broken.

The numbers are bleak in a specific way. Applications have exploded - LinkedIn has reported them climbing something like 45% year-over-year even as the number of openings fell, a lot of it AI-generated - so a single competitive role can pull a thousand applicants in days and recruiters are buried. A meaningful slice of postings are "ghost jobs" that were never going to hire anyone; studies land somewhere between a fifth and a third of listings. And recruiters increasingly don't wait for the pile or wade through it at all - they source, reaching out to candidates who are visible and credible. Applications are roughly half the volume and a minority of the hires.

So "apply to more jobs" is advice for a machine that's jammed. The move that still works is being findable and demonstrably able to do the work - and that's what a work sample is actually for. It isn't a document you hand over on request; it's:

something to post, so you show your thinking where recruiters look and active profiles get noticed,
a source of specific resume bullets and interview talking points grounded in something real you made,
and reps at producing the exact kind of artifact the job will demand on day one.

(I'm careful not to oversell the posting angle - I haven't seen a clean stat that "posting beats applying," so I treat it as getting visible where recruiters already source, which the data does support, rather than a guaranteed lever.)

A 2026 LinkedIn article on optimizing for the new algorithm says that what really drives performance now is comments, shares, profile views after a post, and messages/connection requests after posting. That’s basically LinkedIn saying: posts that trigger DMs/requests send a strong signal to its ranking systems. ~(Axia 2026, How to optimize your LinkedIn profile and posts for the new algorithm).
Career‑coaching and recruiter discussions point out that LinkedIn Recruiter tends to surface “most engaged” users higher in search—people actively commenting and posting, not just keyword‑stuffed profiles. ~ (Nick Rickards 2026, LinkedIn Algorithm Update: What Recruiters Need to Know)
You also see individual case studies where targeted posts lead to inbound recruiter outreach (e.g., a 2026 story about a job seeker whose niche, keyword‑rich posts around their target role led to recruiters finding and contacting them) ~ (Alexia Palau 2026, How Targeted LinkedIn Posts Can Boost Your Interview Chances)

That reframe shaped the product: ProofKit doesn't just generate a work sample, it generates the LinkedIn post, the résumé bullets, and the talking points around it - because the sample is only worth anything if it travels.

Sources for the funnel claims above: the application surge and AI flood - HeroHunt's 2025 recruiting review and Resume-Now's employer survey; ghost jobs at roughly a fifth to a third of listings - CNBC and the Clarify Capital study via Entrepreneur; recruiters sourcing passive, visible candidates over the inbound pile - The Interview Guys and LinkedIn's own talent resources; and getting discovered through an active presence - Built In.

ProofKit takes a target role, background, skills to prove, weak spots, and optional job/resume context, then produces a realistic simulated work sample: a fictional company, a role-specific challenge, a guided builder, a readiness review, and a portfolio packet. The integrity rules are load-bearing. The app never claims real employment or real client work, labels metrics as hypothetical, and includes an ethical disclosure in exports.

🧩 Why small models fit this product

Build Small's constraint turned out to be useful. ProofKit is not an open-ended chatbot. It asks the model to do narrow, structured jobs:

draft one section of a known artifact type,
revise text toward a specific rubric,
turn a finished artifact into interview talking points,
enrich a fictional work sample challenge,
stay honest about fictional companies and hypothetical metrics.

That kind of bounded writing task is exactly where a small model can be useful, especially when the app surrounds it with retrieval, templates, fallbacks, and integrity checks.

How Codex fit the build

Codex was most useful as a continuity layer: not "make me an app," but "stay inside this messy repo with me and help move it forward." I drove the product calls, but used Codex to inspect diffs, trace regressions, patch Gradio/UI behavior, tighten loading states, clean up exports, and keep the Hugging Face Space demo usable.

It also became the engineering partner for the model loop. Across the commits, Codex helped wire together fine-tuning scripts, Modal training, checkpointed evaluation, Qwen 0.5B retraining, distillation/quantization experiments, llama.cpp serving, Transformers-on-Space support, JSON-constrained generation, fallback behavior, and the final judge/evaluation writeups.

The biggest lesson was that agent help worked best when it was grounded in the actual repo: existing code, failing behavior, commit history, and test results. Codex did not replace the product judgment; it compressed the engineering loop between "this is broken" and "here is the smallest change that makes it work."

🛠️ Fine-tuning: from a custom loop to default tools

The hardest part wasn't LoRA — it was getting Windows + PyTorch + CUDA + Jupyter + Hugging Face packages to agree long enough to train. Two scars worth keeping:

ModuleNotFoundError: No module named 'torch'

🐍 A notebook/kernel problem, not a model problem — VS Code was using an interpreter without the training deps. Pinning ipykernel>=6.29,<7 stabilized the notebook, but the command-line script became the reliable path.
🔀 Import order matters on Windows. Importing torch before parts of the HF data stack could hang or crash. Fix: import datasets / pyarrow before torch. Lesson — "the package is installed" ≠ "the runtime is stable."

My first version used a custom transformers.Trainer loop because I needed something explicit while the environment shifted under me. Once the product direction was clear, the better move was to drop back to the default HF post-training tools:

python data/finetune/build_dataset.py
python scripts/finetune_lora.py --epochs 3

That script now uses TRL SFTTrainer + PEFT LoraConfig instead of a raw loop. The dataset builder still produces chat-format examples from ProofKit's own templates, profiles, and prompts — license-safe and domain-specific — while TRL handles the mechanics.

LoRA stayed; the trainer changed. The base model moved from a Qwen2.5-0.5B proof-of-concept to openai/gpt-oss-20b (much more capable, still Build-Small-sized). Full fine-tuning a 20B isn't the point — ProofKit needs a behavior adapter for a narrow workflow:

trainer:         trl.SFTTrainer
adapter:         peft.LoraConfig
target_modules:  all-linear        # attention + router, every layer

gpt-oss adds one twist: it's a Mixture-of-Experts model, so its fused expert tensors can also be adapted via PEFT target_parameters — but that only fits on a big GPU (the Modal H200 path below), not the default recipe.

Repo hygiene that the workflow forced on me:

📦 App code → GitHub + the HF Space repo
🧠 Trained weights → a HF model repo (visproj/proofkit-gpt-oss-20b)
🚫 Large binaries / screenshots → never in Git history (I had to scrub accidental PNGs before pushes were accepted)

🖥️➡️☁️ Two GPUs, one recipe: my RTX 2070 → Modal

All the early fine-tuning ran locally on my own NVIDIA RTX 2070 Super (8 GB VRAM). That was genuinely how the workflow got proven: a 0.5B + LoRA fits in 8 GB, and training on my own card meant a fast edit-run-inspect loop with no cloud round trip.

But 8 GB is a hard ceiling. gpt-oss-20b is ~40 GB in bf16 — not within an order of magnitude of fitting, even with LoRA. So the 20B work moved to Modal (hackathon GPU credits), same TRL + PEFT recipe, push the adapter to the Hub:

$env:BASE_MODEL="openai/gpt-oss-20b"
$env:MODEL_REPO="visproj/proofkit-gpt-oss-20b-lora"
$env:EPOCHS="1"
modal run scripts/modal_train_gpt_oss.py

Two expensive lessons came out of the move to rented GPUs.

⚠️ The "use the latest" trap (MoE edition)

My image listed deps with open lower bounds (transformers>=4.55, kernels>=0.9…) — "newest forever." On build day that resolved to the newest major of everything, and gpt-oss broke in three escalating ways:

1. import:   kernels 0.15 changed an API transformers 5.x calls the old way → ValueError
2. config:   peft's ParamWrapper (for MoE experts) rejects lora_dropout != 0 → crash
3. backward: with kernels removed to dodge #1, the native MoE backward fails →
             "GroupedMmBackward0 ... expected device meta but got cuda:0"

The tell was #3: gpt-oss's experts are fused 3D tensors. Adapting them needs PEFT target_parameters; running them forward and backward needs kernels, which must match transformers. The fix was the opposite of my instinct — pin the validated set instead of chasing newest:

transformers>=4.55,<4.60
kernels>=0.9,<0.10
peft>=0.17,<0.18        # ParamWrapper for the MoE experts
trl>=0.20,<0.24

💡 Dense models (the Qwen candidates) have none of this — no experts, no fused kernels, no version pact. For an MoE model, the training stack is a tested set, not a pile of independent "newest" packages.

💾 The 80 GB wall, and choosing the GPU on purpose

Pinned and training, the run hit a harder wall:

torch.OutOfMemoryError: CUDA out of memory.
GPU 0 has 79.25 GiB total, 155 MiB free.

The arithmetic is unforgiving: gpt-oss-20b dequantizes to ~~40 GB in bf16, and putting LoRA on the MoE experts makes PEFT rebuild each full expert as W + delta (~~1 GB) in the backward pass. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True bought free memory from 155 MB → ~930 MB; shorter sequences got within ~85 MB of fitting — and "within 85 MB" still crashes.

The honest realization: which GPU is a design parameter, not a given. An H100 has the same 80 GB as the A100 — it buys speed, not headroom. Only a higher-memory card helps (H200 = 141 GB, B200 ≈ 180 GB). So rather than guess whether the experts were worth it, I ran both:

🟦 A100-80GB → attention-only adapter (target_modules="all-linear"), experts frozen
🟩 H200-141GB → full expert-adapting recipe (target_parameters re-enabled)

Same base, same data, same hyperparameters — the only difference is whether the experts were adapted. The eval (below) then answered the question with numbers instead of a hunch.

⏱️ The timeout that ate three hours

The two gpt-oss runs were each ~3.8 hr (7,000 examples, 1 epoch, 875 steps, eager attention + gradient checkpointing on a 20B). The Modal function carried a 3-hour timeout I never questioned, and the trainer ran save_strategy="no" — so it would only push the adapter once, at the very end.

FunctionTimeoutError: Task's current input hit its timeout of 10800s

It died at step 684/875 (78%, 2h59m). No checkpoints meant restart from zero, not 78%. Two fixes, the second mattering more:

⏲️ Timeout with real margin (8 hr). A cap should be a backstop against a hung run, not a guillotine that lands mid-training.
💾 Checkpoint to a persistent volume and commit on every save. A timeout kill isn't a clean shutdown, so buffered files never flush — the trainer needs a callback calling volume.commit() per save, so resume_from_checkpoint works.

🔥 save_strategy="no" + a tight timeout is a silent way to light hours of paid GPU on fire. I walked straight into it.

🧪 Things I tried and cut

Not everything that got built shipped — and that's fine:

🦙 Llama 3B → switched to Qwen 3B. I reached for a Llama 3B for the off-grid / llama.cpp path first, but it's gated and clicking through a license mid-flow is friction. An ungated Qwen of the same size does the identical job, and llama.cpp runs any GGUF.
☁️ Hugging Face Jobs — tried, didn't land. I used HF Jobs as a first stab at the off-grid path (a small dense Qwen 3B on a T4) and it did train — but it's not what I shipped. The 3B got replaced by a smaller, better-fitted distilled 0.5B (below). And when I tried to push gpt-oss onto HF Jobs, the MoE model needs MXFP4 handling and a 141 GB card the Jobs flavors don't offer — it hung without timing out and thrashed ~17 hours through my credits before I caught it. The lesson: match the model to the platform. gpt-oss belongs on Modal; HF Jobs was the wrong venue for it.

🍰 Distillation: shrinking the teacher for a free Space

There was still a gap between "I have a fine-tuned gpt-oss-20b" and "a stranger can use it for free." The tuned 20B is ~40 GB dequantized — no free Space runs that. And you can't shrink the adapter: a LoRA is welded to the teacher's exact tensors.

So I distilled the behavior, not the weights:

👨‍🏫 Let the big tuned teacher answer 7,000 of ProofKit's own prompts.
🧑‍🎓 Train a tiny Qwen 0.5B to imitate those answers — it never sees the 20B's internals, only its outputs, but it picks up the house style, structure, honesty.
📦 Convert to GGUF and serve through llama.cpp — free, offline, CPU-only.

Two gotchas:

⚡ vLLM, not model.generate(). A plain fixed-batch loop crawled (every sequence waiting on the slowest in its batch). vLLM retires finished sequences and slots new prompts in per token-step — ~10–20× faster for identical outputs.
🏷️ The final bug. Every distilled answer came out as final{"talking_points"…}. That's a gpt-oss "harmony" channel marker — the special token gets stripped on decode, but the channel name survives as plain text. Left in, the student would have learned to say "final" before every reply. One regex caught it before training. Look at the data before you train on it.

That left two easy-to-confuse 0.5B models worth keeping apart: one trained directly on the 7k set, one distilled from the teacher's answers. Rather than argue which to ship, I kept both and let the eval referee.

📊 Letting the eval settle every argument

By the end there were more "is X actually better than Y?" questions than I could answer by squinting at samples: base vs tuned gpt-oss, attention-only vs experts, direct-SFT vs distilled. Vibes don't scale.

So: one eval, 8 models, the same 15 held-out prompts, the same scoring, all greedy for fairness. The score blends two layers:

final = 100 * (0.6 * judge_mean/5  +  0.4 * gate_pass_rate)

🧮 40% deterministic gates — does the answer have a heading, name a fictional context, state assumptions, name a tradeoff, carry the ethical disclosure, hit a sane length? These encode ProofKit's load-bearing house rules (un-gameable by fluency, but gameable by keyword).
🧑‍⚖️ 60% LLM judge — instruction-following, depth/specificity, house style, integrity (1–5 each).

The piece I'm gladdest I added is a ceiling. A leaderboard ranks your models against each other but can't tell you if the best one is any good. So I dropped in gpt-5.5 — not as a competitor, as a yardstick. If fine-tuning a 0.5B did nothing, I'd see the tuned and untuned models bunch together while a frontier model towered over the lot.

⚠️ A contamination pass had to be fixed first. The raw sweep leaked gpt-oss's hidden reasoning channel into its answers (the judge mistook it for depth) and returned empty gpt-5.5 generations (the reasoning model spent its whole output budget thinking). Strip the channel, give the reasoning model room to answer, re-judge — then the board is trustworthy.

The headline leaderboard

After fixing the training data (more on that next) and re-grading under a three-judge panel — Claude Opus 4.7 (this agent, hand-scoring), gpt-5.5 (blind API), and a local Qwen-3B — here's where it landed (gpt-oss experts is kept un-retrained as a stale reference):

model                            Claude  GPT-5.5  Qwen-3B   AVG
gpt-5.5 (frontier ceiling)        94.6    95.6     90.8    93.7   the yardstick, not a competitor
gpt-oss attn (retrained teacher)  82.0    66.8     81.4    76.7
qwen-0.5b distilled (served)      79.0    68.6     82.2    76.6   ← 40× smaller, free on CPU, ~ties the 20B
qwen-0.5b direct 7k (served)      78.6    64.4     82.0    75.0
gpt-oss experts (STALE old-data)  67.6    68.6     81.8    72.7
qwen-3b base (untuned)            62.1    67.1     80.5    69.9
gpt-oss base (untuned, 20B)       55.4    53.8     68.2    59.1
qwen-0.5b base (untuned)          36.5    44.5     67.9    49.7

Three things fall out:

🏔️ The ceiling is real and far away. gpt-5.5 averages ~94; the best thing I trained is ~77 — a gap no small-model tuning closed. The pitch is not "rivals frontier." (gpt-5.5 also self-prefers, ranking itself #1 — another reason to keep a panel rather than one grader.)
🎯 Fine-tuning bought reliability of a required format. Untuned models (even the 20B) fail the house-rule gates — gpt-oss base names a tradeoff 0% of the time and skips the ethical disclosure. Tuning drives those gates to ~100%. For ProofKit those rules are load-bearing safety constraints, not stylistic preferences — an eloquent sample that omits "this is fictional" is a defect, not a strong sample.
🪶 The Build-Small win is the small one. Across all three judges the distilled 0.5B (76.6 avg) all but ties its own fine-tuned 20B teacher (76.7) — 40× smaller, running free on CPU through llama.cpp — and both served retrained 0.5Bs beat the stale experts and every untuned base.

⚖️ One judge is one opinion — and a data fix that paid off

A single LLM judge is one opinion in a confident costume, so the same 8×15 answers were re-graded by three judges from different families — gpt-5.5 (frontier, blind), a local Qwen-3B (blind, different lineage), and Claude Opus 4.7 (this in-session coding agent). The ranking is judge-dependent — Claude punishes the stale model's formulaic template hard while gpt-5.5 rewards its surface coherence — but what every judge agrees on is in the headline board above: gpt-5.5 is the unanimous ceiling, and both served retrained 0.5Bs beat the stale experts and every untuned base.

🪧 I'm honest about my own column being the weakest. The Claude row was scored inside this coding session — model labels visible, the prior analysis already in context. That's a primed, interested opinion, not a blind grader. The saving grace: the finding (retrained > stale > base) survives deleting my column entirely — the two blind judges rank the served retrained models above the stale experts too.

The one lever that mattered: the data

The root cause of the repetitive reviews and input-ignoring drafts was a template scaffold baked into the training data. build_dataset.py rendered the synthetic user answers and the target drafts from the same skill/constraint slots, so the model learned target = template, not target = f(input) — it hit every gate while describing the work instead of doing it, and dropped whatever the user actually typed. A tester caught it bluntly: "say something about rabbits so I know my answers are in the document" — and got no rabbits.

So I rebuilt the dataset — faithfulness anchors (a distinctive token shared between the user's answer and the target, so the model can't ignore inputs) plus seeded per-example variation across every task (the readiness review went from ~4 canned reasoning strings to 86 distinct across 125 rows) — and retrained the whole chain: teacher → distillation set → student → GGUF. build_dataset.py is the single root cause for everything downstream, so nothing short of a full rebuild moves the needle.

🪤 The most expensive lesson: green exit codes lied. The cloud trainer checkpoints to a persistent volume, and the student silently resumed a prior run's old-data checkpoint (far past the new run's total), decided it was "done," and re-published the old model — exit 0 the whole way. A twin of the trap lived in the model cache (the eval kept scoring a stale GGUF). The only thing that caught either was loading the actual artifact and reading its output. Verify the artifact, not the job status — especially when a persistent volume is involved.

🔬 Stay honest about what improved. The win is in input-faithfulness, format reliability, and variety — a 0.5B still can't reliably copy a truly arbitrary novel token (the runtime baseline-fallback covers that), and judged depth is still capped by model size. The template-scaffold ceiling is dented, not gone.

🎓 Meta-lesson: a single LLM judge is a vibe with a number attached. If a conclusion matters, make a few different models argue about it, and only trust the parts they can't disagree their way out of.

The full combined board is data/eval/leaderboard_3judge_retrain.json; the pre-fix run is preserved under data/eval/old_pretemplatefix/. (OpenAI quota knocked out a fourth, gpt-4o, judge this round — so it's three judges, not four.)

🚀 Serving it: a hosted baseline and a borrowed GPU

The original plan was clean — push the model to a HF repo and point the app at it:

HF_MODEL=visproj/proofkit-gpt-oss-20b

But a model repo is not a guaranteed serverless inference path. Calls through InferenceClient failed with provider/support errors — the repo was valid, but the serverless route wouldn't serve it as a chat model. So serving split by model:

🌐 Standard gpt-oss-20b → hosted HF inference (managed loading, routing, retries).
🧠 Fine-tuned ProofKit 0.5B (proofkit-qwen0.5b-7k) → loads inside the Space with Transformers.
🦙 Distilled 0.5B (proofkit-distilled-qwen0.5b-gguf) → llama.cpp, fully off-grid on CPU.
🔀 The UI toggles between them; if a fine-tuned path is slow or unavailable, the app falls back to gpt-oss-20b and tells the user.

The deployment lesson was ZeroGPU. Rather than pay for a persistent GPU Space, ProofKit runs the in-Space Transformers model on ZeroGPU: an H200 is attached only while a @spaces.GPU-decorated function runs, then released between calls. For a bursty, draft-one-section-at-a-time workload that's a near-perfect fit — but it comes with one hard rule.

🔧 CUDA may only be touched inside the decorated function. So the model loads on CPU at startup and only moves to the H200 inside the generation window, then comes back. The same code path degrades to a no-op decorator off-ZeroGPU, so it runs unchanged on a plain CPU box or my laptop. I'd assumed this needed a big refactor and nearly shipped a plain T4 Space instead — but scoping the GPU work into one function turned out to be the whole job.

PROOFKIT_BACKEND=hf                          # hosted baseline + in-Space fine-tunes
PROOFKIT_BASELINE_MODEL=openai/gpt-oss-20b   # hosted fallback
PROOFKIT_GENERATION_MAX_SECONDS=45           # ZeroGPU clamps each generation to ~45s

The app starts on standard gpt-oss-20b; the fine-tuned models are opt-in from the sidebar; a slow or failed generation triggers a visible fallback notice.

🧱 What's generated vs deterministic

Not every output should be pure LLM text. ProofKit deliberately mixes:

🔎 semantic RAG for role matching
🧩 deterministic templates for structure
📏 heuristics for readiness labels
✨ LLM enrichment where it helps (Builder, autopilot, revisions, scenario/review enrichment, portfolio packet)
🛟 fallback templates when models fail
🔒 explicit integrity rewriting

Recommendations and export formatting are intentionally deterministic — more reliable, and easier to explain.

🎛️ UI and state were harder than the model

A product workflow carries a lot of state — selected model, current goal, role match, selected artifact, challenge, draft sections, review, portfolio packet, export files. Importing a demo profile must reset all downstream state; Start Over must clear the same things; model switching must not mutate unrelated state.

🏗️ The biggest architectural lesson of the final stretch: small-model demos become real apps fast, and real apps need boring, predictable state boundaries. (A future refactor should treat model choice as explicit session state, not env mutation + singleton reset.)

👤 What user testing changed

The model and deployment work were necessary, but the changes that made ProofKit usable came from watching a non-technical recruiter friend try it — exactly who it's for. Three surprises:

🗣️ Vocabulary. "Artifact" and "Preferred artifact style" meant nothing; "Skills to prove" read as skills you haven't proven yet — the opposite of the intent. Fix: plain language ("Type of work sample to create," "What skills do you want to demonstrate?"), example placeholders, helper text. The boxes that said "one per line" only made sense once the placeholder showed example lines. The words in the form are part of the product.
🧭 Navigation. He'd generate something and not know which tab to click; after "Autopilot all sections," the draft rendered below a tall stack of controls so it looked like nothing happened. Fixes: an explicit forward button on every step, and splitting Builder into a Builder (do the work) and a Work Sample tab (see the result), auto-advancing so output lands where the eyes already are.
🌍 Coverage anxiety. Great when a template/KB entry existed for the exact role, worse when it didn't (generic case + unrelated Role Match). Rather than define every job, I leaned on the small model for graceful degradation — when a role isn't covered, ProofKit asks the model to name the artifact that profession produces and synthesize a grounded role profile. (I also added demos/templates for the paths people kept reaching for: recruiting, sales, clinical/nursing, pharmacy, executive, product.) For a long-tail product, graceful degradation beats exhaustive coverage.

Smaller wins from testing: exports now include a reusable "Your Inputs & Answers" sheet; revisions can apply to the whole document, not just one section; the portfolio packet lost its GitHub README (only ever made sense for developers).

🎓 The through-line: most of these were product/UX decisions, not model decisions. A small model can carry a real product — but only if the workflow around it is legible to someone seeing it for the first time.

🔒 The prompts became part of the model

The strangest bug didn't look like a bug. After wiring the fine-tuned 0.5B into the app, output turned to mush — Chinese tokens mid-sentence, JSON key names leaking into a LinkedIn post, resume bullets that described bullets instead of being them. My instinct was to blame the model. The cause was on my side of the API.

The app kept evolving after training — prompts reworded, integrity rules added, schema descriptions inlined — but the model didn't evolve with it. A fine-tune this small doesn't learn "follow instructions in general." It learns these prompt shapes → these outputs. Reword the prompt and you're handing it a distribution it's never seen, and the Qwen base leaks through.

The fix: a shared, frozen prompt contract. One module (prompt_formats.py) builds every prompt for both the dataset generator and the live app, byte-for-byte identical, with a test that regenerating the training set produces zero diff. New runtime context is only allowed in as extra Label: value lines in the same visual style the model trained on.

Three layered lessons I didn't expect:

🔐 Fine-tuning converts your prompts from copy you can edit into an interface you must version. With a hosted model, prompts are a writing problem. With a small fine-tune, "just improve the wording" is a breaking change requiring a dataset rebuild and retrain.
🧱 It freezes the task, not just the wording. I asked the readiness review for free-form prose (it was trained on strict JSON); the distilled model returned a tidy section draft instead — it reached for the nearest task it did know. A 0.5B holds a lookup table of (exact prompt shape → output shape); anything off the table degrades to the nearest neighbor. So the app carries two prompt sets — rich, editable ones for the big baseline, frozen trained shapes for the small models — selected by backend.
🪤 A distilled student inherits the ceiling of the data it imitated. My SFT targets were template-generated, so the student learned to reproduce templates; no inference-time cleverness (I tried nudging temperature) pulls out variety that was never there. You can't distill quality you never put in the dataset — the only real lever is upstream (richer targets, distill again). This is exactly what the retrain above set out to fix.

The detail that surprised me most: running my two 0.5B models on the same review, the directly fine-tuned one writes a real, content-aware review while the distilled-then-q4-quantized one reproduces the template word-for-word. Same data, opposite behavior — the directly-tuned model generalized the task; the distilled-and-compressed one memorized it. Two effects stack on the small one: sequence-level distillation teaches it to copy near-template answers, and 4-bit quantization sands off the low-probability tokens that would let it deviate. How you shrink a model is its own quality decision — the cheaper-to-run the artifact, the more rigid it tends to be.

📝 What I'd tell myself on day one

🖥️ Use the command-line training script earlier; notebooks on Windows are a convenience, not the source of truth. Verify the active interpreter before debugging "missing" packages, and mind import order (datasets/pyarrow before torch).
🔧 Once the proof-of-concept works, move custom training code back to default HF tools (TRL + PEFT) — less maintenance risk than your own tokenization loop.
📌 For an MoE model like gpt-oss, pin the training stack — kernels and transformers must match. "Install the latest of everything" is a bug, not a best practice. Dense models don't have this constraint.
🎛️ On rented GPUs, the card is a design choice. An H100 has the same 80 GB as an A100; only a higher-memory card (H200 141 GB, B200 ~180 GB) buys headroom. Match GPU memory to the recipe before assuming you're stuck.
💾 Give long cloud runs a timeout with margin AND checkpoints that survive an ungraceful kill (commit to the volume on every save). A tight timeout + save_strategy="no" silently loses hours of paid compute.
👀 Look at a few rows of any generated dataset before training on it. One leaked channel marker would have taught the student to babble "final" before every answer.
📐 Give every eval a ceiling. Ranking your models against each other can't tell you whether the best one is any good; a frontier reference can.
🧑‍⚖️ One LLM judge is one opinion. If a conclusion matters, make several diverse judges argue, and trust only what they agree on.
🎯 A small model can lose on prose and still win the job if the job is conformance to a format — that gap is the reason to fine-tune at all.
🗂️ The real lever for a distilled model's quality is the data, not the method. Fix the dataset (faithfulness anchors, less templating) and re-distill before reaching for temperature/prompt tweaks. Our retrain proved this — tuned models jumped ~7–9 points once the data leakage was fixed.
🔒 Fine-tuning freezes your prompts and the task. Share one prompt-builder module between the dataset script and the app; treat any wording change as a breaking change. Ask a JSON-trained model for prose and it produces the nearest task it knows, not the one you asked for.
🪚 How you shrink the model is a quality decision. On the same data, the directly fine-tuned 0.5B generalized; the distilled-then-q4 one memorized the template. Direct SFT leaves more headroom than distill-then-quantize.
🚢 A model repo ≠ a guaranteed serverless inference path. Hosted inference is smoother because it's managed; ZeroGPU lets you borrow an H200 only while a @spaces.GPU function runs (load on CPU, touch CUDA only inside the window) instead of paying for a persistent GPU Space. Make fine-tuned an option, not a single point of failure — and tell the user when fallback happens.
🧱 Keep deterministic scaffolding around the model; put it in front of a real non-technical user early; plain words beat jargon in every field; for a long-tail product, graceful degradation beats trying to cover every case.

ProofKit is still the same idea it started as: help people turn career claims into credible, ethical proof. The fine-tuned model matters — but the bigger lesson is that the model is only one part of the system. The product works because small-model generation, retrieval, templates, guardrails, exports, and state management all cooperate. And the single most valuable finding, the one four judges couldn't argue away: a tiny, free, offline model — distilled the right way, from the right data — can do the right-shaped work reliably, and match a model forty times its size at it.

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote