QuantSafe Certifier
Signed release-gate records for quantized small models.
Last winter, security researchers used Shodan to find hundreds of exposed Clawdbot (now OpenClaw) instances leaking their owners' Anthropic API keys, Telegram and Slack tokens, and months of private conversations — to internet-wide scans, and in the worst cases full root access. It was the clearest sign yet of something we all feel: people are racing to run AI on their own hardware, and security is trailing the excitement.
Clawdbot leaked at the infrastructure layer — misconfigured gateways, plaintext secrets, auth bypass. Real, and now widely discussed. But the local-AI attack surface is layered, and there is one layer almost nobody is watching: the model itself.
You can't run a 7B (let alone a 70B) model on a laptop or a single consumer GPU at full precision. So you quantize it — GPTQ, AWQ, GGUF, bitsandbytes — to fit. We treat that as a quality decision: does the quantized model still pass my evals? Usually, yes.
What we don't check is whether quantization changed the model's safety behavior — and it can, invisibly, because the damage never shows up in a quality eval.
A concrete example from my own published checkpoints: phi-2 quantized to GPTQ 4-bit holds its task-quality benchmarks almost perfectly, while its refusal rate on a harmful-prompt probe set collapses from 91% to 1%. It still answers your normal questions well. It just stopped refusing the ones it shouldn't. Quality is not a safety proxy under quantization.
I'd shipped a model that looked fine and had quietly stopped refusing what it shouldn't — and I couldn't find a lightweight screen that compared a quantized checkpoint to its baseline and routed that drift to deeper evaluation. So I built one.
QuantSafe is a Refusal Stability screen. Instead of relabeling safety by hand, it compares a candidate (quantized) model against its full-precision baseline on four behavioral features of the model's refusal shape — how consistent its refusal openings are, how diverse, how long. It combines those into a single drift score against calibrated weights, classifies the result LOW / MODERATE / HIGH, and signs a tamper-evident, issuer-pinned record proving what was screened and with which configuration — not that the model is safe. A HIGH result doesn't say "unsafe" — it routes the configuration to a full safety battery. The method, calibration, and validation (a 45-cell matched matrix; family-held-out AUC) are in the paper.
The easy trap with a project like this is grading your own homework: label a corpus yourself, then report how well your judges agree with your labels. So I did the uncomfortable thing and re-ran the safety guards against an external, third-party human-labeled benchmark (PKU-Alignment/BeaverTails, 400 items). The measured accuracies come down — and that's the point. On independent labels the guards land in the low-to-mid 80s, and a 0.6B guard (Qwen3Guard) matches an 8B one (Granite Guardian) — a small-model result worth taking seriously. I now lead with the external numbers everywhere, because a screen you can't trust on labels you didn't write isn't worth much.
As a prospective, out-of-distribution check, I applied the frozen screen — no recalibration — to two model families absent from the calibration matrix, quantized with a method it was never tuned on. It correctly cleared Falcon3-3B (no refusal loss → LOW) and flagged SmolLM2-1.7B (a measured 10-point refusal-rate drop → MODERATE, material loss). That's a direction check on n=2, not a powered generalization claim — and I say so in the app and the field notes.
It's a screen that tells you when to go run the real evaluation. It is not a safety certificate, it does not prove a model is safe, and the signed record proves the integrity of the check, not the safety of the model. Those limits are stated in the app, the README, and the field notes — because the honesty is the point.
You don't have to take my word for it on my models. QuantSafe exposes a public API (/screen_external_manifest) that screens aggregate evidence — no raw prompts, completions, or weights leave your machine. You compute a handful of refusal-shape features for your baseline and your quantized variant locally, send those numbers, and get back a provisional, unsigned screening recommendation with per-feature contributions and feedback. Run it with gradio_client in a few lines.
This started as a screen for my own published catalog; it should be bigger than that. If you're a researcher, engineer, or lab working on local-model safety, quantization, or deployment, I'd love to collaborate — co-authorship, joint work, or piloting QuantSafe on your own deployments.
Built for the Hugging Face Build Small hackathon using Nemotron, Modal, OpenAI Codex, OpenBMB MiniCPM, and Gradio.
→ Try it: https://huggingface.co/spaces/build-small-hackathon/quantsafe-certifier → Read the method: arXiv:2606.10154
Signed release-gate records for quantized small models.
More from this author