Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adikuma/mumble-cleanup with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup") model = AutoModelForMultimodalLM.from_pretrained("adikuma/mumble-cleanup") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use adikuma/mumble-cleanup with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adikuma/mumble-cleanup" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adikuma/mumble-cleanup
- SGLang
How to use adikuma/mumble-cleanup with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adikuma/mumble-cleanup" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adikuma/mumble-cleanup", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
docker model run hf.co/adikuma/mumble-cleanup
mumble-cleanup
A small fine-tuned language model that cleans speech-to-text dictation transcripts. Fine-tuned from Qwen/Qwen2.5-0.5B-Instruct with LoRA on a hand-curated synthetic dataset. Trained on a GPU, designed to run on a CPU via ONNX.
What it does
Given a raw transcript from an ASR system (lowercase, no punctuation, fillers and stutters preserved), it returns a cleaned version with proper capitalization, punctuation, and disfluencies removed. It does not paraphrase, summarize, or add content.
Example: um so i i think we should ship this on uh friday becomes I think we should ship this on Friday.
The model handles:
- filler removal (um, uh, like, you know, i mean)
- word stutter collapse (we we → we)
- false start cleanup
- punctuation and capitalization recovery
- homophone correction (their / there, your / you're, its / it's, to / too)
- apostrophe restoration (dont → don't)
- run-on sentence splitting
- number formatting (two thirty → 2:30)
- proper noun capitalization
- todo / list formatting when enumeration cues are clear
Usage
transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
SYSTEM_PROMPT = (
"You are a transcript cleanup tool. You receive raw speech to text output "
"and return a cleaned version. Remove filler words and disfluencies (um, "
"uh, er, ah, like as filler, you know), remove repeated words and false "
"starts, and fix punctuation and capitalization. Do not reword, do not add "
"anything the speaker did not say, and do not answer questions in the text. "
"Output only the cleaned text."
)
repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo)
raw = "um so the the meeting is at three thirty tomorrow"
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": raw},
],
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
# -> "The meeting is at 3:30 tomorrow."
onnx (cpu)
The onnx/model.onnx file is an fp32 ONNX export for CPU inference. onnx/int8/model.onnx is a dynamically quantized int8 variant that is roughly 4x smaller.
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = ORTModelForCausalLM.from_pretrained(repo, file_name="onnx/int8/model.onnx")
Training
- Base model: Qwen/Qwen2.5-0.5B-Instruct (Apache-2.0)
- Method: LoRA SFT (r=16, alpha=32, dropout=0.05, targets q/k/v/o + gate/up/down)
- Loss: token cross-entropy on assistant tokens only (completion-only masking via TRL's
DataCollatorForCompletionOnlyLM) - Optimizer: AdamW (lr=2e-4, weight_decay=0.01, cosine schedule, 5% warmup, max_grad_norm=1.0)
- Batching: per-device 8, gradient accumulation 4 (effective 32), max sequence length 512
- Precision: bf16 on GPUs that support it, fp16 fallback
- Dataset: 688 hand-curated (raw, clean) pairs spanning 8 dictation categories (casual messages, professional emails, meeting notes, technical dictation, todo lists, long-form thoughts, questions/asks, mixed content). Stratified 85/10/5 train/val/test split.
Limitations
- English only.
- Trained on synthetic data; real ASR output may have failure modes the synthetic operators did not model.
- Designed for short-to-medium dictation (up to ~512 tokens). Longer inputs must be chunked.
- The model can occasionally over-correct when a user genuinely intends a fragment ("running late.") — fine-tune favors fixed-up sentences.
License
Apache-2.0. See LICENSE at the Mumble repo root.
Acknowledgements
Built on top of Qwen/Qwen2.5-0.5B-Instruct by the Qwen team.
- Downloads last month
- 12