LAU-Soloni 114M (MSE Semantic Anchor, λ=1)
lau-soloni-114m-mse-k1 is an end-to-end Speech Translation (ST) model that incorporates Listen, Attend, Understand (LAU) semantic regularization. It translates Bambara audio directly into French text. Unlike standard ST models, it uses a semantic anchor during training to stabilize the acoustic encoder against high-variance "amateur" labels.
🚨 Important Note
This model is a research artifact focused on semantic stability in low-resource and high variance settings. As noted in the associated research, it was trained on "amateur" translations which exhibit high variance. Users should expect:
- High performance on semantic intent but potential orthographic mistakes in the French output.
- Better performance using the CTC decoding branch for this specific checkpoint.
NVIDIA NeMo: Custom Model Class
To use this model, you must use the custom HybridRNNTCTCLAUModel class, which overrides the standard NeMo EncDecHybridRNNTCTCBPEModel to support the semantic loss and head integration.
The full implementation of this class, along with training and evaluation scripts, is available on our Anonymous GitHub
pip install nemo-toolkit['asr']
# Ensure you have the custom LAU model class from our repository in your python path.
How to Use This Model
Load Model
import nemo.collections.asr as nemo_asr
# Loading the custom LAU-regularized model
st_model = nemo_asr.models.HybridRNNTCTCLAUModel.from_pretrained(model_name="anonymousnowhere/lau-soloni-114m-mse-k1")
Translate Audio (CTC Recommended)
# Switch to CTC or TDT decoding
ctc_decoding_cfg = st_model.cfg.aux_ctc.decoding
st_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
# Translate
st_model.transcribe(['bambara_sample.wav'])
Model Architecture
This model features a FastConformer encoder. A projection head is attached to the encoder's output, this head is used only during training to regularize using a Mean Squared Error (MSE) loss against a frozen high-resource semantic text embedding. This "anchors" the acoustic features to a known linguistic space.
Training
The training followed the LAU framework:
- Pre-training: Initialized from
soloni-114m-tdt-ctc-v0. - Semantic Regularization: Fine-tuned on Jeli-ASR (30h) using a dual-objective: standard translation loss and a semantic auxiliary loss with MSE.
- Hyperparameters: AdamW optimizer, Noam scheduler, 1,000-step warmup, and a peak LR of 0.001.
Dataset
The model was trained on Jeli-ASR, a corpus of ~30 hours of Bambara speech. The translations are "semi-professional," with a significant portion provided by native speakers without formal linguistic training, creating the high-variance environment that LAU is designed to handle.
Evaluation
Performance is measured on the Jeli-ASR test set using Word Error Rate (WER), Character Error Rate (CER), and BLEU.
| Benchmark | Decoding | WER (%) ↓ | CER (%) ↓ | BLEU ↑ |
|---|---|---|---|---|
| Jeli-ASR Test | CTC | 76.08 | 58.64 | 14.29 |
| Jeli-ASR Test | TDT | 85.27 | 69.60 | 7.45 |
License
This model is released under the CC-BY-4.0 license.
- Downloads last month
- 6
Model tree for RobotsMali/lau-soloni-114m-mse-k1
Base model
nvidia/parakeet-tdt_ctc-110mDataset used to train RobotsMali/lau-soloni-114m-mse-k1
Evaluation results
- Test BLEU on Jeli-ASRtest set self-reported14.290
- Test WER on Jeli-ASRtest set self-reported76.080
- Test CER on Jeli-ASRtest set self-reported58.640