Improving Code Switching with Supervised Fine Tuning and GELU Adapters
Paper • 2506.00291 • Published
Fine-tuned openai/whisper-small (244M params) for English-Chinese (Mandarin) code-switching automatic speech recognition, targeting customer service use cases.
| Component | Detail |
|---|---|
| Base Model | openai/whisper-small (244M params) |
| Method | LoRA (r=32, α=64) on q_proj, v_proj, k_proj, out_proj |
| Dataset | CAiRE/ASCEND — 10.62h spontaneous EN-ZH code-switching |
| Key Trick | Switching Tokenizer: language-aware prefix tokens per utterance |
| Optimizer | AdamW, lr=1e-3, warmup=100 steps |
| Epochs | 10 |
| Batch Size | 8 × 2 gradient accumulation = 16 effective |
| Hardware | A10G (24GB VRAM) |
This model implements findings from multiple papers:
forced_decoder_ids = None: Disabled so the model can handle code-switching (default forces single language)suppress_tokens = []: Allow all tokens including both language tokens<|zh|> prefix, English use <|en|>, mixed use <|zh|> (matrix language)from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch
# Load base model + LoRA adapter
processor = WhisperProcessor.from_pretrained("timliau/whisper-small-zh-en-code-switching")
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base_model, "timliau/whisper-small-zh-en-code-switching")
model = model.merge_and_unload() # merge LoRA for faster inference
# Disable forced language
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
# Transcribe
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
input_features = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
# Example output: "我刚刚跟customer service讲了一下他们说可以refund"
pip install transformers datasets peft accelerate evaluate jiwer librosa soundfile torch trackio
python train.py
zh, en, mixed per utterance| Language | Train Count |
|---|---|
| mixed | ~5,000 |
| zh | ~3,500 |
| en | ~1,500 |
For better performance, consider:
openai/whisper-large-v3 (1.5B params) — literature shows larger = better for code-switchingThe standard metric for code-switching ASR is Mixed Error Rate (MER) — a combination of Word Error Rate (WER) for English segments and Character Error Rate (CER) for Chinese segments.
Apache 2.0 (same as Whisper)
Base model
openai/whisper-small