Uhm β€” filler-word detection

On-device filler-word detection. Frame-level classifier that finds "uh", "um", "hmm", and other filler sounds in audio with 20 ms timestamps.

Trained on English speech; produces high-confidence detections on conversational Spanish, French, German, and Dutch audio without retraining.

Try it

Live demo: detail-co/uhm-demo.

Models

Two tiers. uhm-base.* is the free default (HuBERT-base, higher recall). uhm-pro.* is the Pro tier (DistilHuBERT, the distilled model: smaller, faster, more precise).

Tier Model ID Backbone Core ML iPhone 17 Pro iPhone 15 Pro iPad Pro M4 M4 Max
Default (free) uhm-base.* HuBERT-base 8-bit, 90 MB 2.20 s Β· 136Γ— 3.42 s Β· 88Γ— 2.16 s Β· 139Γ— 204Γ—
Pro uhm-pro.* DistilHuBERT fp16, 45 MB 1.01 s Β· 296Γ— 1.78 s Β· 169Γ— 1.08 s Β· 279Γ— 445Γ—

Realtime factor = audio duration Γ· analyze time. Warm analyze over a real 5-minute speech clip (1 cold load + 5 warm runs), model load excluded (one-time per process). iPhone and iPad measured via the examples/ios-bench app (iOS 26.5 / iPadOS 26.3.1); M4 Max via coremltools predict (macOS 15). Pro (the distil) runs about 2.2Γ— faster on-device and is 2Γ— smaller than the default.

Accuracy: precision (Pro) versus recall (default)

Measured on a held-out real-world set: 251 hand-labeled Pro-vs-default detection disagreements across 37 clips not in training.

  • On its unique fires, Pro (the distil) holds 81% precision; the default (HuBERT) holds 73%.
  • Pro uniquely caught 73 real fillers the default missed. The default uniquely caught 103 the Pro missed, but at the cost of 22 extra false fires.
  • The default fires roughly 15% more total, but the surplus is mostly low-confidence noise (the gap widens to about 30% more at the recall preset).

So it is a precision versus recall split, not a strict quality ranking. Pick Pro when a flagged filler gets cut without review; pick the default when you want maximum coverage and a human confirms.

Tuned for precision over recall by default. Use the --bias knob or per-frame threshold to shift that trade-off if you need higher recall.

Files

The filenames are tier-named: uhm-base.* is the HuBERT default, uhm-pro.* is the DistilHuBERT Pro. Both tiers ship three formats. The Core ML variants preserve 100 % argmax agreement with the fp32 PyTorch reference on test inputs.

Tier File Format Size Use
Default (HuBERT) uhm-base.mlpackage.zip Core ML 8-bit ~88 MB (90 MB unpacked) iOS / macOS on-device inference
Default (HuBERT) uhm-base-web-fp16.onnx ONNX fp16 ~189 MB Browser, server, Python β€” runs anywhere with onnxruntime
Default (HuBERT) uhm-base.onnx ONNX fp32 ~378 MB Quantization-free reference
Pro (DistilHuBERT) uhm-pro.mlpackage.zip Core ML fp16 45 MB unpacked iOS / macOS on-device inference
Pro (DistilHuBERT) uhm-pro-web-fp16.onnx ONNX fp16 ~51 MB Browser, server, Python β€” runs anywhere with onnxruntime
Pro (DistilHuBERT) uhm-pro.onnx ONNX fp32 ~98 MB Quantization-free reference

Why Pro ships fp16, not 8-bit: the distil is more quant-sensitive than the HuBERT. An 8-bit distil keeps only 97% (raw) / 95% (enhanced) detection agreement against fp16, while the 8-bit HuBERT holds 98% / 98%. fp16 is about 5% slower than 8-bit on-device but lossless, and still 2.2Γ— faster than the default, so it is the right ship for the precision tier. The free default stays 8-bit at 90 MB (an fp16 HuBERT would be about 180 MB).

Source weights (safetensors-checkpoint/) are provided as a fine-tuning starting point alongside config.json, preprocessor_config.json, and labels.json.

Input / output

  • Input: 16 kHz mono audio, up to 30-second windows
  • Output: per-frame softmax over 6 classes, one prediction every 20 ms
  • Class indices: 0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other

Usage

Python (PyTorch β€” source weights)

from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor
import soundfile as sf

extractor = AutoFeatureExtractor.from_pretrained("detail-co/uhm")
model     = AutoModelForAudioFrameClassification.from_pretrained("detail-co/uhm")

audio, sr = sf.read("in.wav")  # 16 kHz mono
inputs    = extractor(audio, sampling_rate=16000, return_tensors="pt")
logits    = model(**inputs).logits          # (1, T, 6)
preds     = logits.argmax(-1)

Python (ONNX β€” pick a tier)

from huggingface_hub import hf_hub_download
import onnxruntime as ort

path    = hf_hub_download("detail-co/uhm", "uhm-base-web-fp16.onnx")  # or uhm-pro-...
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])

Core ML (iOS / macOS)

Download uhm-base.mlpackage.zip or uhm-pro.mlpackage.zip, unzip, load with MLModel. Input shape (1, 480000) float32 β€” 30 s at 16 kHz mono. Output (1, 1499, 6) softmax probabilities. Requires iOS 17 / macOS 14 or newer.

Versioning

main always points at the latest released weights. Tagged releases (e.g. v2.0.0) are immutable snapshots β€” pin to one of these for reproducible builds.

The HuggingFace Hub APIs accept a revision= parameter on every download call to lock to a tag or commit SHA:

hf_hub_download("detail-co/uhm", "uhm-base.mlpackage.zip", revision="v2.0.0")

LFS file content is content-addressed via SHA-256; verify against the lfs.sha256 field returned by HfApi.list_repo_tree(...) if your deployment needs integrity checks.

Limitations

  • Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
  • Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap will degrade quality.
  • Type labels (uh / um / hmm / and / other) are secondary β€” trust filler vs not_filler more than the specific subtype.

Built on

License

CC BY-NC 4.0. Free for research, evaluation, and personal use with attribution. Commercial use requires a separate license β€” contact paul@detail.co.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using detail-co/uhm 1