Whisper-Small for English-Chinese Code-Switching ASR

Fine-tuned openai/whisper-small (244M params) for English-Chinese (Mandarin) code-switching automatic speech recognition, targeting customer service use cases.

Training Recipe

Component Detail
Base Model openai/whisper-small (244M params)
Method LoRA (r=32, α=64) on q_proj, v_proj, k_proj, out_proj
Dataset CAiRE/ASCEND — 10.62h spontaneous EN-ZH code-switching
Key Trick Switching Tokenizer: language-aware prefix tokens per utterance
Optimizer AdamW, lr=1e-3, warmup=100 steps
Epochs 10
Batch Size 8 × 2 gradient accumulation = 16 effective
Hardware A10G (24GB VRAM)

Literature Basis

This model implements findings from multiple papers:

  1. "Improving Code Switching with SFT and GELU Adapters" (2506.00291) — Switching Tokenizer trick reduces ASCEND Total MER from 25.5% → 17.1% (→ 9.4% with GELU adapters)
  2. "LoRA-Whisper: Parameter-Efficient Multilingual ASR" (2406.06619) — LoRA r=32 optimal, matches full fine-tuning at 5% trainable params
  3. "CS-Dialogue" (2502.18913) — Whisper-Medium fine-tuned achieves 7.53% MER on 104h code-switching dialogue

Critical Design Decisions

  • forced_decoder_ids = None: Disabled so the model can handle code-switching (default forces single language)
  • suppress_tokens = []: Allow all tokens including both language tokens
  • Language-aware tokenization: Chinese utterances use <|zh|> prefix, English use <|en|>, mixed use <|zh|> (matrix language)
  • No new tokens added: Adding tokens to Whisper's tokenizer breaks pretrained embeddings (per 2506.00291)

Usage

Inference

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch

# Load base model + LoRA adapter
processor = WhisperProcessor.from_pretrained("timliau/whisper-small-zh-en-code-switching")
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base_model, "timliau/whisper-small-zh-en-code-switching")
model = model.merge_and_unload()  # merge LoRA for faster inference

# Disable forced language
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

# Transcribe
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
input_features = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features

with torch.no_grad():
    predicted_ids = model.generate(input_features)
    
transcription = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
# Example output: "我刚刚跟customer service讲了一下他们说可以refund"

Training

pip install transformers datasets peft accelerate evaluate jiwer librosa soundfile torch trackio
python train.py

Dataset: CAiRE/ASCEND

  • 10.62 hours of spontaneous code-switching conversation
  • 23 bilingual speakers from Hong Kong
  • 3 splits: train (8.5h), validation (1h), test (~1h)
  • Language labels: zh, en, mixed per utterance
Language Train Count
mixed ~5,000
zh ~3,500
en ~1,500

Improving Further

For better performance, consider:

  1. Scale up base model: Use openai/whisper-large-v3 (1.5B params) — literature shows larger = better for code-switching
  2. Add GELU encoder adapters: The paper's Part 2 (GELU adapters) pushes MER from 17.1% → 9.4%
  3. More data: Apply for SEAME (115h EN-ZH CS) or CS-Dialogue (104h)
  4. SenseVoice alternative: FunAudioLLM/SenseVoiceSmall achieves 6.71% MER zero-shot on code-switching + has built-in emotion detection (great for customer service)
  5. Domain adaptation: Fine-tune further on your own customer service call recordings

Evaluation Metrics

The standard metric for code-switching ASR is Mixed Error Rate (MER) — a combination of Word Error Rate (WER) for English segments and Character Error Rate (CER) for Chinese segments.

License

Apache 2.0 (same as Whisper)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for timliau/whisper-small-zh-en-code-switching

Adapter
(232)
this model

Dataset used to train timliau/whisper-small-zh-en-code-switching

Papers for timliau/whisper-small-zh-en-code-switching