ScrubData planner — Qwen3-4B fine-tuned for tabular cleaning plans

A ≤4B planner for hands-off data cleaning: it reads an aggregated column profile (per-value frequency counts) and emits a structured JSON cleaning plan that a deterministic pandas executor applies. Built for the Build Small Hackathon (🏡 Backyard AI · Tiny Titan · Well-Tuned).

Live demo: https://huggingface.co/spaces/build-small-hackathon/scrubdata · Code (open source): https://github.com/ricalanis/scrubdata-hackathon · Paper: docs/paper/ in the repo · Traces: build-small-hackathon/scrubdata-traces

What's special about the training data

Every training example is execution-verified: a candidate (dirty table, plan) pair is kept only if running the executor on it provably recovers the known-clean table. Mix: synthetic high-cardinality categorical tables (Zipf long-tail + realistic typos) + 20% real-derived pairs from the Raha benchmarks (cell-aligned, learnable canonicalizations only).

Shipped composition (WS1 — verified union planner): in the product, every model-proposed mapping is scored by a deterministic verifier (errors-are-rare frequency gates, variant similarity, reference agreement; threshold SCRUBDATA_TAU, default 0.5) and unioned with the grounded heuristic. Measured on hospital's 509 real errors: 0.905 precision @ 0.413 coverage (gated model plan alone: 0.993 @ 0.287 — 146/147 committed changes correct; seed-robust: 0.891 ± 0.012 @ 0.396 ± 0.025 over 3 training seeds). Dropped merges become review flags, never silent skips.

Measured

  • Canonicalization micro-F1 0.90 (best single run; 0.80 ± 0.01 over 3 training seeds) (vs 0.45 for a much larger zero-shot generic model, 0.13 for a rule heuristic) on frozen held-out gold.
  • Real hospital typos (Raha, OOD): repair recall 0.00 → 0.42 from adding the real-derived 20% (synthetic-only fails to transfer — documented honestly).
  • In production the model is wrapped with reference grounding + calibrated abstention (it never free-generates a canonical for a grounded column type).

Post-freeze system (v2, June 2026) and the central finding

The shipped pipeline around this model gained four deterministic capabilities — bounded suspect surfacing for high-cardinality columns, a generic entity reference (exact-hit typing floor), cross-row majority voting with a false-consensus guard, and convention-conservatism gates. Measured on the WildClean benchmark: unseen-source macro F1 0.363 @ damage 0.0219 (35 unseen-source pairs of the 42-pair benchmark; full-bench 0.343 @ 0.023), 0 silent edits across 35 wild tables and a 239-table GitTables trust audit.

Honest scope of these weights: five further fine-tunes and a 3-arm GRPO pilot (executor as verifiable reward) all failed to move held-out generalization — at this scale the weights contribute format competence and in-distribution skill; the deterministic machinery + plan-level verifier carry never-seen-table generalization. Details in the paper. In the same verify+union harness, two of three zero-shot 24–31B open-weights planners exceed this operating point (0.915 @ 0.485 vs 0.905 @ 0.413, paper §scaling); the 4B remains the most precise gated planner at usable coverage and the only locally-measured one — the architecture, not the fine-tune, is the portable contribution.

How to run

Ollama / llama.cpp (recommended): use the non-thinking Modelfile from the Space repo (notebooks/Modelfile). Q8_0 GGUF: ricalanis/scrubdata-qwen3-4b-v6-q8 (Q4_K_M corrupts this model on Unsloth 2026.6.x exports — use Q8_0).

Transformers (bf16 + adapter): suppress the tool-call tokens at decode time or the base model's tool-calling prior dominates:

model.generate(..., suppress_tokens=[151657, 151658])  # <tool_call>, </tool_call>

Limitations

English-centric; plans use a closed op vocabulary; canonicalization quality on entity columns depends on the reference taxonomy's coverage; not a de-identification guarantee.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ricalanis/scrubdata-qwen3-4b

Finetuned
(1740)
this model

Space using ricalanis/scrubdata-qwen3-4b 1