LAU-Soloni 114M (MSE Semantic Anchor, λ=1)

Model architecture | Model size | Language

lau-soloni-114m-mse-k1 is an end-to-end Speech Translation (ST) model that incorporates Listen, Attend, Understand (LAU) semantic regularization. It translates Bambara audio directly into French text. Unlike standard ST models, it uses a semantic anchor during training to stabilize the acoustic encoder against high-variance "amateur" labels.

🚨 Important Note

This model is a research artifact focused on semantic stability in low-resource and high variance settings. As noted in the associated research, it was trained on "amateur" translations which exhibit high variance. Users should expect:

  • High performance on semantic intent but potential orthographic mistakes in the French output.
  • Better performance using the CTC decoding branch for this specific checkpoint.

NVIDIA NeMo: Custom Model Class

To use this model, you must use the custom HybridRNNTCTCLAUModel class, which overrides the standard NeMo EncDecHybridRNNTCTCBPEModel to support the semantic loss and head integration.

The full implementation of this class, along with training and evaluation scripts, is available on our Anonymous GitHub

pip install nemo-toolkit['asr']
# Ensure you have the custom LAU model class from our repository in your python path.

How to Use This Model

Load Model

import nemo.collections.asr as nemo_asr
# Loading the custom LAU-regularized model
st_model = nemo_asr.models.HybridRNNTCTCLAUModel.from_pretrained(model_name="anonymousnowhere/lau-soloni-114m-mse-k1")

Translate Audio (CTC Recommended)

# Switch to CTC or TDT decoding
ctc_decoding_cfg = st_model.cfg.aux_ctc.decoding
st_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)

# Translate
st_model.transcribe(['bambara_sample.wav'])

Model Architecture

This model features a FastConformer encoder. A projection head is attached to the encoder's output, this head is used only during training to regularize using a Mean Squared Error (MSE) loss against a frozen high-resource semantic text embedding. This "anchors" the acoustic features to a known linguistic space.

Training

The training followed the LAU framework:

  1. Pre-training: Initialized from soloni-114m-tdt-ctc-v0.
  2. Semantic Regularization: Fine-tuned on Jeli-ASR (30h) using a dual-objective: standard translation loss and a semantic auxiliary loss with MSE.
  3. Hyperparameters: AdamW optimizer, Noam scheduler, 1,000-step warmup, and a peak LR of 0.001.

Dataset

The model was trained on Jeli-ASR, a corpus of ~30 hours of Bambara speech. The translations are "semi-professional," with a significant portion provided by native speakers without formal linguistic training, creating the high-variance environment that LAU is designed to handle.

Evaluation

Performance is measured on the Jeli-ASR test set using Word Error Rate (WER), Character Error Rate (CER), and BLEU.

Benchmark Decoding WER (%) ↓ CER (%) ↓ BLEU ↑
Jeli-ASR Test CTC 76.08 58.64 14.29
Jeli-ASR Test TDT 85.27 69.60 7.45

License

This model is released under the CC-BY-4.0 license.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RobotsMali/lau-soloni-114m-mse-k1

Finetuned
(4)
this model

Dataset used to train RobotsMali/lau-soloni-114m-mse-k1

Evaluation results