Instructions to use adikuma/mumble-cleanup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adikuma/mumble-cleanup with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("adikuma/mumble-cleanup")
model = AutoModelForMultimodalLM.from_pretrained("adikuma/mumble-cleanup")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use adikuma/mumble-cleanup with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adikuma/mumble-cleanup"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adikuma/mumble-cleanup

SGLang

How to use adikuma/mumble-cleanup with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adikuma/mumble-cleanup" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adikuma/mumble-cleanup" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adikuma/mumble-cleanup",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use adikuma/mumble-cleanup with Docker Model Runner:
```
docker model run hf.co/adikuma/mumble-cleanup
```

mumble-cleanup

A small fine-tuned language model that cleans speech-to-text dictation transcripts. Fine-tuned from Qwen/Qwen2.5-0.5B-Instruct with LoRA on a hand-curated synthetic dataset. Trained on a GPU, designed to run on a CPU via ONNX.

What it does

Given a raw transcript from an ASR system (lowercase, no punctuation, fillers and stutters preserved), it returns a cleaned version with proper capitalization, punctuation, and disfluencies removed. It does not paraphrase, summarize, or add content.

Example: um so i i think we should ship this on uh friday becomes I think we should ship this on Friday.

The model handles:

filler removal (um, uh, like, you know, i mean)
word stutter collapse (we we → we)
false start cleanup
punctuation and capitalization recovery
homophone correction (their / there, your / you're, its / it's, to / too)
apostrophe restoration (dont → don't)
run-on sentence splitting
number formatting (two thirty → 2:30)
proper noun capitalization
todo / list formatting when enumeration cues are clear

Usage

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

SYSTEM_PROMPT = (
    "You are a transcript cleanup tool. You receive raw speech to text output "
    "and return a cleaned version. Remove filler words and disfluencies (um, "
    "uh, er, ah, like as filler, you know), remove repeated words and false "
    "starts, and fix punctuation and capitalization. Do not reword, do not add "
    "anything the speaker did not say, and do not answer questions in the text. "
    "Output only the cleaned text."
)

repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo)

raw = "um so the the meeting is at three thirty tomorrow"
prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": raw},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
# -> "The meeting is at 3:30 tomorrow."

onnx (cpu)

The onnx/model.onnx file is an fp32 ONNX export for CPU inference. onnx/int8/model.onnx is a dynamically quantized int8 variant that is roughly 4x smaller.

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

repo = "adikuma/mumble-cleanup"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = ORTModelForCausalLM.from_pretrained(repo, file_name="onnx/int8/model.onnx")

Training

Base model: Qwen/Qwen2.5-0.5B-Instruct (Apache-2.0)
Method: LoRA SFT (r=16, alpha=32, dropout=0.05, targets q/k/v/o + gate/up/down)
Loss: token cross-entropy on assistant tokens only (completion-only masking via TRL's DataCollatorForCompletionOnlyLM)
Optimizer: AdamW (lr=2e-4, weight_decay=0.01, cosine schedule, 5% warmup, max_grad_norm=1.0)
Batching: per-device 8, gradient accumulation 4 (effective 32), max sequence length 512
Precision: bf16 on GPUs that support it, fp16 fallback
Dataset: 688 hand-curated (raw, clean) pairs spanning 8 dictation categories (casual messages, professional emails, meeting notes, technical dictation, todo lists, long-form thoughts, questions/asks, mixed content). Stratified 85/10/5 train/val/test split.

Limitations

English only.
Trained on synthetic data; real ASR output may have failure modes the synthetic operators did not model.
Designed for short-to-medium dictation (up to ~512 tokens). Longer inputs must be chunked.
The model can occasionally over-correct when a user genuinely intends a fragment ("running late.") — fine-tune favors fixed-up sentences.

License

Apache-2.0. See LICENSE at the Mumble repo root.

Acknowledgements

Built on top of Qwen/Qwen2.5-0.5B-Instruct by the Qwen team.

Downloads last month: 12

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for adikuma/mumble-cleanup

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(614)

this model