Uhm β filler-word detection
On-device filler-word detection. Frame-level classifier that finds "uh", "um", "hmm", and other filler sounds in audio with 20 ms timestamps.
Trained on English speech; produces high-confidence detections on conversational Spanish, French, German, and Dutch audio without retraining.
Try it
Live demo: detail-co/uhm-demo.
Models
Two tiers. uhm-base.* is the free default (HuBERT-base, higher recall). uhm-pro.* is the Pro tier (DistilHuBERT, the distilled model: smaller, faster, more precise).
| Tier | Model ID | Backbone | Core ML | iPhone 17 Pro | iPhone 15 Pro | iPad Pro M4 | M4 Max |
|---|---|---|---|---|---|---|---|
| Default (free) | uhm-base.* |
HuBERT-base | 8-bit, 90 MB | 2.20 s Β· 136Γ | 3.42 s Β· 88Γ | 2.16 s Β· 139Γ | 204Γ |
| Pro | uhm-pro.* |
DistilHuBERT | fp16, 45 MB | 1.01 s Β· 296Γ | 1.78 s Β· 169Γ | 1.08 s Β· 279Γ | 445Γ |
Realtime factor = audio duration Γ· analyze time. Warm analyze over a real 5-minute speech clip (1 cold load + 5 warm runs), model load excluded (one-time per process). iPhone and iPad measured via the examples/ios-bench app (iOS 26.5 / iPadOS 26.3.1); M4 Max via coremltools predict (macOS 15). Pro (the distil) runs about 2.2Γ faster on-device and is 2Γ smaller than the default.
Accuracy: precision (Pro) versus recall (default)
Measured on a held-out real-world set: 251 hand-labeled Pro-vs-default detection disagreements across 37 clips not in training.
- On its unique fires, Pro (the distil) holds 81% precision; the default (HuBERT) holds 73%.
- Pro uniquely caught 73 real fillers the default missed. The default uniquely caught 103 the Pro missed, but at the cost of 22 extra false fires.
- The default fires roughly 15% more total, but the surplus is mostly low-confidence noise (the gap widens to about 30% more at the recall preset).
So it is a precision versus recall split, not a strict quality ranking. Pick Pro when a flagged filler gets cut without review; pick the default when you want maximum coverage and a human confirms.
Tuned for precision over recall by default. Use the --bias knob or per-frame threshold to shift that trade-off if you need higher recall.
Files
The filenames are tier-named: uhm-base.* is the HuBERT default, uhm-pro.* is the DistilHuBERT Pro. Both tiers ship three formats. The Core ML variants preserve 100 % argmax agreement with the fp32 PyTorch reference on test inputs.
| Tier | File | Format | Size | Use |
|---|---|---|---|---|
| Default (HuBERT) | uhm-base.mlpackage.zip |
Core ML 8-bit | ~88 MB (90 MB unpacked) | iOS / macOS on-device inference |
| Default (HuBERT) | uhm-base-web-fp16.onnx |
ONNX fp16 | ~189 MB | Browser, server, Python β runs anywhere with onnxruntime |
| Default (HuBERT) | uhm-base.onnx |
ONNX fp32 | ~378 MB | Quantization-free reference |
| Pro (DistilHuBERT) | uhm-pro.mlpackage.zip |
Core ML fp16 | 45 MB unpacked | iOS / macOS on-device inference |
| Pro (DistilHuBERT) | uhm-pro-web-fp16.onnx |
ONNX fp16 | ~51 MB | Browser, server, Python β runs anywhere with onnxruntime |
| Pro (DistilHuBERT) | uhm-pro.onnx |
ONNX fp32 | ~98 MB | Quantization-free reference |
Why Pro ships fp16, not 8-bit: the distil is more quant-sensitive than the HuBERT. An 8-bit distil keeps only 97% (raw) / 95% (enhanced) detection agreement against fp16, while the 8-bit HuBERT holds 98% / 98%. fp16 is about 5% slower than 8-bit on-device but lossless, and still 2.2Γ faster than the default, so it is the right ship for the precision tier. The free default stays 8-bit at 90 MB (an fp16 HuBERT would be about 180 MB).
Source weights (safetensors-checkpoint/) are provided as a fine-tuning starting point alongside config.json, preprocessor_config.json, and labels.json.
Input / output
- Input: 16 kHz mono audio, up to 30-second windows
- Output: per-frame softmax over 6 classes, one prediction every 20 ms
- Class indices:
0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other
Usage
Python (PyTorch β source weights)
from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor
import soundfile as sf
extractor = AutoFeatureExtractor.from_pretrained("detail-co/uhm")
model = AutoModelForAudioFrameClassification.from_pretrained("detail-co/uhm")
audio, sr = sf.read("in.wav") # 16 kHz mono
inputs = extractor(audio, sampling_rate=16000, return_tensors="pt")
logits = model(**inputs).logits # (1, T, 6)
preds = logits.argmax(-1)
Python (ONNX β pick a tier)
from huggingface_hub import hf_hub_download
import onnxruntime as ort
path = hf_hub_download("detail-co/uhm", "uhm-base-web-fp16.onnx") # or uhm-pro-...
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])
Core ML (iOS / macOS)
Download uhm-base.mlpackage.zip or uhm-pro.mlpackage.zip, unzip, load with MLModel. Input shape (1, 480000) float32 β 30 s at 16 kHz mono. Output (1, 1499, 6) softmax probabilities. Requires iOS 17 / macOS 14 or newer.
Versioning
main always points at the latest released weights. Tagged releases (e.g. v2.0.0) are immutable snapshots β pin to one of these for reproducible builds.
The HuggingFace Hub APIs accept a revision= parameter on every download call to lock to a tag or commit SHA:
hf_hub_download("detail-co/uhm", "uhm-base.mlpackage.zip", revision="v2.0.0")
LFS file content is content-addressed via SHA-256; verify against the lfs.sha256 field returned by HfApi.list_repo_tree(...) if your deployment needs integrity checks.
Limitations
- Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
- Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap will degrade quality.
- Type labels (uh / um / hmm / and / other) are secondary β trust filler vs not_filler more than the specific subtype.
Built on
- Base architecture and pretrained weights:
ntu-spml/distilhubertβ Apache 2.0. A distilled variant offacebook/hubert-base-ls960β Apache 2.0. - Public fine-tuning audio: AMI Meeting Corpus (
edinburghcstr/ami, IHM split) β CC BY 4.0, Edinburgh CSTR. - Video content created by the Detail team β proprietary.
License
CC BY-NC 4.0. Free for research, evaluation, and personal use with attribution. Commercial use requires a separate license β contact paul@detail.co.