mumble-cleanup

A small fine-tuned language model that cleans speech-to-text dictation transcripts. Fine-tuned from Qwen/Qwen2.5-0.5B-Instruct with LoRA on a hand-curated synthetic dataset. Trained on a GPU, designed to run on a CPU via ONNX.

What it does

Given a raw transcript from an ASR system (lowercase, no punctuation, fillers and stutters preserved), it returns a cleaned version with proper capitalization, punctuation, and disfluencies removed. It does not paraphrase, summarize, or add content.

Example: um so i i think we should ship this on uh friday becomes I think we should ship this on Friday.

The model handles:

  • filler removal (um, uh, like, you know, i mean)
  • word stutter collapse (we we → we)
  • false start cleanup
  • punctuation and capitalization recovery
  • homophone correction (their / there, your / you're, its / it's, to / too)
  • apostrophe restoration (dont → don't)
  • run-on sentence splitting
  • number formatting (two thirty → 2:30)
  • proper noun capitalization
  • todo / list formatting when enumeration cues are clear

Usage

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

SYSTEM_PROMPT = (
    "You are a transcript cleanup tool. You receive raw speech to text output "
    "and return a cleaned version. Remove filler words and disfluencies (um, "
    "uh, er, ah, like as filler, you know), remove repeated words and false "
    "starts, and fix punctuation and capitalization. Do not reword, do not add "
    "anything the speaker did not say, and do not answer questions in the text. "
    "Output only the cleaned text."
)

repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo)

raw = "um so the the meeting is at three thirty tomorrow"
prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": raw},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
# -> "The meeting is at 3:30 tomorrow."

onnx (cpu)

The onnx/model.onnx file is an fp32 ONNX export for CPU inference. onnx/int8/model.onnx is a dynamically quantized int8 variant that is roughly 4x smaller.

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = ORTModelForCausalLM.from_pretrained(repo, file_name="onnx/int8/model.onnx")

Training

  • Base model: Qwen/Qwen2.5-0.5B-Instruct (Apache-2.0)
  • Method: LoRA SFT (r=16, alpha=32, dropout=0.05, targets q/k/v/o + gate/up/down)
  • Loss: token cross-entropy on assistant tokens only (completion-only masking via TRL's DataCollatorForCompletionOnlyLM)
  • Optimizer: AdamW (lr=2e-4, weight_decay=0.01, cosine schedule, 5% warmup, max_grad_norm=1.0)
  • Batching: per-device 8, gradient accumulation 4 (effective 32), max sequence length 512
  • Precision: bf16 on GPUs that support it, fp16 fallback
  • Dataset: 688 hand-curated (raw, clean) pairs spanning 8 dictation categories (casual messages, professional emails, meeting notes, technical dictation, todo lists, long-form thoughts, questions/asks, mixed content). Stratified 85/10/5 train/val/test split.

Limitations

  • English only.
  • Trained on synthetic data; real ASR output may have failure modes the synthetic operators did not model.
  • Designed for short-to-medium dictation (up to ~512 tokens). Longer inputs must be chunked.
  • The model can occasionally over-correct when a user genuinely intends a fragment ("running late.") — fine-tune favors fixed-up sentences.

License

Apache-2.0. See LICENSE at the Mumble repo root.

Acknowledgements

Built on top of Qwen/Qwen2.5-0.5B-Instruct by the Qwen team.

Downloads last month
12
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adikuma/mumble-cleanup

Adapter
(614)
this model