MMS-1B Robust Language Identification (LID) with Viterbi Decoding
This model is a robust Language Identification (LID) system based on Meta's MMS-1B (wav2vec2) architecture. It identifies 19 languages + Silence using Frame-wise Cross Entropy Loss and employs a Viterbi Decoder for stable, smoothed timestamp generation.
Unlike standard classification models that suffer from "flickering" (rapidly switching languages), this model uses a transition penalty (probability shift) to enforce segment continuity. This makes it ideal for Code-Switching detection, Speech Segmentation, and ASR preprocessing.
🌟 Key Features
- Backbone:
facebook/mms-1b-fl102(Frozen feature extractor + Trainable top layers) - Method: Frame-wise Classification (Cross Entropy) instead of CTC.
- Decoding: Viterbi Algorithm with Transition Penalty to reduce noise.
- Output: Precise start/end timestamps for each language segment.
- Support: 19 Languages + Silence detection.
- Easy Deployment: Pip-installable package with automatic model downloading.
📦 Installation
You can install the inference package via PyPI:
pip install mms-lid
Note: Requires Python 3.8+ and PyTorch 2.0+.
🚀 Usage
The model weights will be automatically downloaded from this Hugging Face repository upon the first execution.
1. Python API
Use the LIDPipeline class to process audio files or tensors directly in your Python code.
from mms_lid import LIDPipeline
# Initialize Pipeline
# (Downloads the model automatically to ./weights/ on the first run)
pipeline = LIDPipeline()
# Predict (Supports file path or torch tensor)
audio_path = "path/to/your/audio.wav"
segments = pipeline.predict(audio_path)
# Print Results
print(f"{'Start':<8} | {'End':<8} | {'Language'}")
print("-" * 30)
for seg in segments:
print(f"{seg['start']:<8.2f} | {seg['end']:<8.2f} | {seg['label']}")
2. Command Line Interface (CLI)
You can also use the terminal command mms_lid to process files and save results to JSON.
# Basic usage
mms_lid input.wav --output result.json
# Specify a custom model path (optional)
mms_lid input.wav --model_path ./my_model.pth
Output format (JSON):
[
{
"label": "ko",
"start": 0.0,
"end": 4.22,
"id": 0
},
{
"label": "en",
"start": 4.22,
"end": 8.50,
"id": 1
}
]
🌍 Supported Languages
The model distinguishes between Silence and the following 19 languages:
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
ko |
Korean | en |
English | ja |
Japanese |
zh |
Chinese | fr |
French | de |
German |
es |
Spanish | it |
Italian | pt |
Portuguese |
ru |
Russian | ar |
Arabic | hi |
Hindi |
tr |
Turkish | ms |
Malay | da |
Danish |
fi |
Finnish | nl |
Dutch | no |
Norwegian |
sv |
Swedish | <silence> |
Non-speech |
🛠 Model Details
Architecture
- Input: 16kHz Raw Audio Waveform (Mono).
- Encoder: Wav2Vec2 (MMS-1B) - The last 6 layers of the encoder were fine-tuned.
- Head: Linear Projection to 20 classes (19 langs + 1 silence).
Decoding Logic (Viterbi)
This model does not use simple argmax. Instead, it uses Viterbi decoding with a transition scale parameter.
- Transition Penalty: A penalty is applied when the predicted language changes. This discourages the model from changing languages too frequently due to short noise or uncertain frames.
- Result: This produces cleaner, more coherent segmentation compared to standard frame-wise classification.
Training Configuration
- Loss Function: CrossEntropyLoss
- Dataset: FLEURS (Fine-tuned for LID tasks)
- Data Augmentation: Random mixing of multi-language audio segments during training to simulate code-switching scenarios.
⚖️ License
This project is licensed under the CC-BY-NC 4.0 License. Based on Meta's MMS Model (CC-BY-NC 4.0).
👨💻 Author
- Developed by: N01N9
- Repository: HuggingFace Repo