MMS-1B Robust Language Identification (LID) with Viterbi Decoding

This model is a robust Language Identification (LID) system based on Meta's MMS-1B (wav2vec2) architecture. It identifies 19 languages + Silence using Frame-wise Cross Entropy Loss and employs a Viterbi Decoder for stable, smoothed timestamp generation.

Unlike standard classification models that suffer from "flickering" (rapidly switching languages), this model uses a transition penalty (probability shift) to enforce segment continuity. This makes it ideal for Code-Switching detection, Speech Segmentation, and ASR preprocessing.

🌟 Key Features

Backbone: facebook/mms-1b-fl102 (Frozen feature extractor + Trainable top layers)
Method: Frame-wise Classification (Cross Entropy) instead of CTC.
Decoding: Viterbi Algorithm with Transition Penalty to reduce noise.
Output: Precise start/end timestamps for each language segment.
Support: 19 Languages + Silence detection.
Easy Deployment: Pip-installable package with automatic model downloading.

📦 Installation

You can install the inference package via PyPI:

pip install mms-lid

Note: Requires Python 3.8+ and PyTorch 2.0+.

🚀 Usage

The model weights will be automatically downloaded from this Hugging Face repository upon the first execution.

1. Python API

Use the LIDPipeline class to process audio files or tensors directly in your Python code.

from mms_lid import LIDPipeline

# Initialize Pipeline
# (Downloads the model automatically to ./weights/ on the first run)
pipeline = LIDPipeline()

# Predict (Supports file path or torch tensor)
audio_path = "path/to/your/audio.wav"
segments = pipeline.predict(audio_path)

# Print Results
print(f"{'Start':<8} | {'End':<8} | {'Language'}")
print("-" * 30)
for seg in segments:
    print(f"{seg['start']:<8.2f} | {seg['end']:<8.2f} | {seg['label']}")

2. Command Line Interface (CLI)

You can also use the terminal command mms_lid to process files and save results to JSON.

# Basic usage
mms_lid input.wav --output result.json

# Specify a custom model path (optional)
mms_lid input.wav --model_path ./my_model.pth

Output format (JSON):

[
    {
        "label": "ko",
        "start": 0.0,
        "end": 4.22,
        "id": 0
    },
    {
        "label": "en",
        "start": 4.22,
        "end": 8.50,
        "id": 1
    }
]

🌍 Supported Languages

The model distinguishes between Silence and the following 19 languages:

Code	Language	Code	Language	Code	Language
`ko`	Korean	`en`	English	`ja`	Japanese
`zh`	Chinese	`fr`	French	`de`	German
`es`	Spanish	`it`	Italian	`pt`	Portuguese
`ru`	Russian	`ar`	Arabic	`hi`	Hindi
`tr`	Turkish	`ms`	Malay	`da`	Danish
`fi`	Finnish	`nl`	Dutch	`no`	Norwegian
`sv`	Swedish	`<silence>`	Non-speech

🛠 Model Details

Architecture

Input: 16kHz Raw Audio Waveform (Mono).
Encoder: Wav2Vec2 (MMS-1B) - The last 6 layers of the encoder were fine-tuned.
Head: Linear Projection to 20 classes (19 langs + 1 silence).

Decoding Logic (Viterbi)

This model does not use simple argmax. Instead, it uses Viterbi decoding with a transition scale parameter.

Transition Penalty: A penalty is applied when the predicted language changes. This discourages the model from changing languages too frequently due to short noise or uncertain frames.
Result: This produces cleaner, more coherent segmentation compared to standard frame-wise classification.

Training Configuration

Loss Function: CrossEntropyLoss
Dataset: FLEURS (Fine-tuned for LID tasks)
Data Augmentation: Random mixing of multi-language audio segments during training to simulate code-switching scenarios.

⚖️ License

This project is licensed under the CC-BY-NC 4.0 License. Based on Meta's MMS Model (CC-BY-NC 4.0).

👨‍💻 Author

Developed by: N01N9
Repository: HuggingFace Repo

Downloads last month: -; Downloads are not tracked for this model. How to track

N01N9
/

mms-1b-ll-lid-timestamp