Uhm — filler-word detection

On-device filler-word detection. Frame-level classifier that finds "uh", "um", "hmm", and other filler sounds in audio with 20 ms timestamps.

Trained on English speech; produces high-confidence detections on conversational Spanish, French, German, and Dutch audio without retraining.

Try it

Live demo: detail-co/uhm-demo.

Models

Two tiers. uhm-base.* is the free default (HuBERT-base, higher recall). uhm-pro.* is the Pro tier (DistilHuBERT, the distilled model: smaller, faster, more precise).

Tier	Model ID	Backbone	Core ML	iPhone 17 Pro	iPhone 15 Pro	iPad Pro M4	M4 Max
Default (free)	`uhm-base.*`	HuBERT-base	8-bit, 90 MB	2.20 s · 136×	3.42 s · 88×	2.16 s · 139×	204×
Pro	`uhm-pro.*`	DistilHuBERT	fp16, 45 MB	1.01 s · 296×	1.78 s · 169×	1.08 s · 279×	445×

Realtime factor = audio duration ÷ analyze time. Warm analyze over a real 5-minute speech clip (1 cold load + 5 warm runs), model load excluded (one-time per process). iPhone and iPad measured via the examples/ios-bench app (iOS 26.5 / iPadOS 26.3.1); M4 Max via coremltools predict (macOS 15). Pro (the distil) runs about 2.2× faster on-device and is 2× smaller than the default.

Accuracy: precision (Pro) versus recall (default)

Measured on a held-out real-world set: 251 hand-labeled Pro-vs-default detection disagreements across 37 clips not in training.

On its unique fires, Pro (the distil) holds 81% precision; the default (HuBERT) holds 73%.
Pro uniquely caught 73 real fillers the default missed. The default uniquely caught 103 the Pro missed, but at the cost of 22 extra false fires.
The default fires roughly 15% more total, but the surplus is mostly low-confidence noise (the gap widens to about 30% more at the recall preset).

So it is a precision versus recall split, not a strict quality ranking. Pick Pro when a flagged filler gets cut without review; pick the default when you want maximum coverage and a human confirms.

Tuned for precision over recall by default. Use the --bias knob or per-frame threshold to shift that trade-off if you need higher recall.

Files

The filenames are tier-named: uhm-base.* is the HuBERT default, uhm-pro.* is the DistilHuBERT Pro. Both tiers ship three formats. The Core ML variants preserve 100 % argmax agreement with the fp32 PyTorch reference on test inputs.

Tier	File	Format	Size	Use
Default (HuBERT)	`uhm-base.mlpackage.zip`	Core ML 8-bit	~88 MB (90 MB unpacked)	iOS / macOS on-device inference
Default (HuBERT)	`uhm-base-web-fp16.onnx`	ONNX fp16	~189 MB	Browser, server, Python — runs anywhere with `onnxruntime`
Default (HuBERT)	`uhm-base.onnx`	ONNX fp32	~378 MB	Quantization-free reference
Pro (DistilHuBERT)	`uhm-pro.mlpackage.zip`	Core ML fp16	45 MB unpacked	iOS / macOS on-device inference
Pro (DistilHuBERT)	`uhm-pro-web-fp16.onnx`	ONNX fp16	~51 MB	Browser, server, Python — runs anywhere with `onnxruntime`
Pro (DistilHuBERT)	`uhm-pro.onnx`	ONNX fp32	~98 MB	Quantization-free reference

Why Pro ships fp16, not 8-bit: the distil is more quant-sensitive than the HuBERT. An 8-bit distil keeps only 97% (raw) / 95% (enhanced) detection agreement against fp16, while the 8-bit HuBERT holds 98% / 98%. fp16 is about 5% slower than 8-bit on-device but lossless, and still 2.2× faster than the default, so it is the right ship for the precision tier. The free default stays 8-bit at 90 MB (an fp16 HuBERT would be about 180 MB).

Source weights (safetensors-checkpoint/) are provided as a fine-tuning starting point alongside config.json, preprocessor_config.json, and labels.json.

Input / output

Input: 16 kHz mono audio, up to 30-second windows
Output: per-frame softmax over 6 classes, one prediction every 20 ms
Class indices: 0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other

Usage

Python (PyTorch — source weights)

from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor
import soundfile as sf

extractor = AutoFeatureExtractor.from_pretrained("detail-co/uhm")
model     = AutoModelForAudioFrameClassification.from_pretrained("detail-co/uhm")

audio, sr = sf.read("in.wav")  # 16 kHz mono
inputs    = extractor(audio, sampling_rate=16000, return_tensors="pt")
logits    = model(**inputs).logits          # (1, T, 6)
preds     = logits.argmax(-1)

Python (ONNX — pick a tier)

from huggingface_hub import hf_hub_download
import onnxruntime as ort

path    = hf_hub_download("detail-co/uhm", "uhm-base-web-fp16.onnx")  # or uhm-pro-...
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])

Core ML (iOS / macOS)

Download uhm-base.mlpackage.zip or uhm-pro.mlpackage.zip, unzip, load with MLModel. Input shape (1, 480000) float32 — 30 s at 16 kHz mono. Output (1, 1499, 6) softmax probabilities. Requires iOS 17 / macOS 14 or newer.

Versioning

main always points at the latest released weights. Tagged releases (e.g. v2.0.0) are immutable snapshots — pin to one of these for reproducible builds.

The HuggingFace Hub APIs accept a revision= parameter on every download call to lock to a tag or commit SHA:

hf_hub_download("detail-co/uhm", "uhm-base.mlpackage.zip", revision="v2.0.0")

LFS file content is content-addressed via SHA-256; verify against the lfs.sha256 field returned by HfApi.list_repo_tree(...) if your deployment needs integrity checks.

Limitations

Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap will degrade quality.
Type labels (uh / um / hmm / and / other) are secondary — trust filler vs not_filler more than the specific subtype.

Built on

Base architecture and pretrained weights: ntu-spml/distilhubert — Apache 2.0. A distilled variant of facebook/hubert-base-ls960 — Apache 2.0.
Public fine-tuning audio: AMI Meeting Corpus (edinburghcstr/ami, IHM split) — CC BY 4.0, Edinburgh CSTR.
Video content created by the Detail team — proprietary.

License

CC BY-NC 4.0. Free for research, evaluation, and personal use with attribution. Commercial use requires a separate license — contact paul@detail.co.

Downloads last month: -; Downloads are not tracked for this model. How to track

detail-co
/

uhm