FitCheck spec parser (Qwen3-1.7B LoRA)

Turns messy human descriptions of computers — "my dad's old Dell, i5, 16 gigs, some nvidia card" — into the structured spec JSON used by FitCheck, the honest "what AI can your computer run" advisor. This powers its paste box.

The rule it is trained toward: missing information should become null, not a guess. It is tuned to prefer null over inventing, and does so far more than the base model, but it is not perfect: on a builder-blind sealed test it still invents a value about 18% of the time it should say null (vs 37% for the base model). See Evaluation for the honest numbers.

Training data: grounded, not synthetic-echo

Labels are never model-generated: every training example starts from a real machine (GPUs + VRAM from a mix of vendor pages and community-compiled spec tables, e.g. canirun.ai; 212 cards + Apple chips); only the phrasing varies, across ~24 registers mimicking how people actually write (casual chat, dxdiag dumps, Task Manager paste, seller listings, consoles, comparisons, half-remembered specs, several languages). ~39% of examples have no GPU to extract — the don't-invent cases. Trained with Unsloth (bf16 LoRA, completion-only loss) on a single RTX 5090 laptop.

Evaluation

Dev set (human-written, builder-iterated, optimistic)

Evaluated on a 45-example human-written dev set (never generator output; multilingual, consoles, buying-intent traps, pure refusals). The builder iterated against this set, so these are dev numbers, optimistically biased by adaptive iteration and labelled as such:

round	field accuracy	invented-field rate (hallucination)
1	77.3%	32.5%
3 (answer-only loss + explicit rules)	85.8%	12.0%
5 (final)	91.6%	1.2%

Sealed test (builder-blind, evaluated once), the honest number

A 40-example sealed test, generated by a separate LLM that never saw the training data and evaluated exactly once (machine-generated, so labelled as such rather than human-written), checked for zero overlap with train and dev:

model	field accuracy	invented-field rate
base Qwen3-1.7B, zero-shot	71.5%	37.1%
this LoRA	88.0%	17.7%

The LoRA clearly beats the base model (accuracy +16.5 points, invented rate roughly halved), but it does NOT clear the ship gate's under-5% invented-field target on builder-blind data: the real hallucination rate is about 18%, far above the 1.2% the adaptively-iterated dev set suggested. Reported unedited, because catching exactly that optimism is what a sealed test is for.

Caveat: the sealed labels are machine-generated and unaudited, and some of the "inventions" are debatable integrated-graphics cases (the model extracts an iGPU the generator marked null), so the absolute figure carries some upward bias; a human-audited sealed set would tighten it. The direction is unambiguous.

Ship gate (beat base zero-shot AND keep invented-field rate under 5%): clears the beat-base half, fails the under-5% half on the sealed set. Treat this as a strong extractor that nulls far more often than the base model, not a near-zero-hallucination one. Reproduce with scripts/eval_spec_lora.py --testfile <sealed> --baseline <base.json>; signed result artifacts are in the project repo under artifacts/.

Output schema

{"computer": "Windows laptop|Windows desktop|Mac|Linux PC|Mini PC / Raspberry Pi|null",
 "ram_gb": "number|null", "provider": "nvidia|amd|apple|intel|none|null",
 "gpu": "string|null", "vram_gb": "number|null"}

Notable learned rules: "none" only when the text says there's no graphics card (unknown → null); a series alone ("gtx") is a provider, not a GPU; a stated VRAM figure beats the model's knowledge of that card; dxdiag's "Display Memory" is not system RAM; "8gb dev kit" on a Jetson is unified RAM, not VRAM; two machines compared → extract nothing.

Part of the FitCheck project (Build Small hackathon): a deterministic engine does the math; small models appear only where they earn their place.