Ganda NeMo (Experimental) β€” Luganda TTS

A research-preview Luganda text-to-speech system built on NVIDIA NeMo, targeting on-device and edge deployment on mobile phones. Ships as two NeMo checkpoints: a FastPitch acoustic model and a HiFi-GAN vocoder.

Status: experimental. Released as-is for research, evaluation.

Model summary

Component File Arch Size
Acoustic model luganda_fastpitch.nemo FastPitch (FFTransformer, 6L, d=384) 187 MB
Vocoder luganda_hifigan.nemo HiFi-GAN v1 (upsample [8,8,2,2], 512 ch) 339 MB
  • Language: Luganda (ISO 639-1 lg)
  • Sample rate: 22,050 Hz
  • Mel config: 80 bins, n_fft=1024, hop=256, win=1024, range 0–8000 Hz
  • License: Apache-2.0

Intended use

  • Research on low-resource African-language TTS.
  • Prototyping on-device / edge Luganda voice output on mobile (primary deployment target).
  • Benchmarking and comparison against other Luganda / Bantu-language TTS systems.

Training data

Trained on the Sunbird SALT Luganda subset β€” a mixed male/female multi-speaker corpus. Approximately 2,380 clips / ~2.69 hours were used.

Model architecture & training

FastPitch (acoustic model)

  • FFTransformer encoder/decoder: 6 layers, 1 head, d_model=384, d_inner=1536, dropout 0.1
  • Duration + pitch predictors: 2-layer temporal predictors, filter size 256
  • Learned alignment (learn_alignment: true), bin-loss warmup 100 epochs
  • Pitch stats (z-score): ΞΌ=190.65 Hz, Οƒ=51.25 Hz
  • Optimizer: Adam, lr=1e-4, betas=(0.9, 0.999), weight_decay=1e-6, batch size 24
  • Training steps: ~20,000
  • NeMo version at training time: 1.8.0rc0

HiFi-GAN (vocoder)

  • Generator: upsample rates [8, 8, 2, 2], kernel sizes [16, 16, 4, 4], initial channels 512
  • Resblock type 1, kernel sizes [3, 7, 11], dilations [[1,3,5], [1,3,5], [1,3,5]]
  • Optimizer: AdamW, lr=2e-4, betas=(0.8, 0.99), batch size 16
  • Training steps: ~20,000
  • NeMo version at training time: 1.23.0

Limitations

Text frontend β€” English G2P. Luganda does not yet have a mature open-source grapheme-to-phoneme (G2P) resource. FastPitch here uses NeMo's EnglishPhonemesTokenizer with EnglishG2p (CMUdict) as the text frontend since luganda's spelling is largely phonemic. Building a proper Luganda phonemizer is an obvious next step and contributions are welcome.

Single output voice. The model is trained on multi-speaker data but emits one averaged voice. No speaker conditioning is available at inference.

Low-resource training. ~2.7 hours of speech is small for TTS. Expect audible artifacts, uneven prosody on long sentences, and reduced robustness on numerals, code-switched English, and out-of-distribution domains.

Text normalization. The packaged text normalizer is nemo_text_processing.text_normalization.Normalizer with lang: en. Non-trivial Luganda text normalization (e.g., number reading, abbreviations) is not handled β€” pre-normalize input in your pipeline.

Usage

from nemo.collections.tts.models import FastPitchModel, HifiGanModel
import soundfile as sf

fastpitch = FastPitchModel.restore_from("luganda_fastpitch.nemo")
hifigan = HifiGanModel.restore_from("luganda_hifigan.nemo")

text = "Oli otya?"
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)

sf.write("out.wav", audio.to("cpu").numpy().squeeze(), 22050)

Loading from the Hub

from huggingface_hub import hf_hub_download
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

fp = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_fastpitch.nemo")
hg = hf_hub_download("Cal3bd3v/ganda-nemo-experimental", "luganda_hifigan.nemo")

fastpitch = FastPitchModel.restore_from(fp)
hifigan = HifiGanModel.restore_from(hg)

Edge / on-device deployment

The primary deployment target is mobile. Suggested paths:

  • Export FastPitch and HiFi-GAN to ONNX via NeMo's exporter, then run with ONNX Runtime Mobile / ExecuTorch.
  • HiFi-GAN dominates runtime; a distilled / smaller vocoder (e.g., HiFi-GAN v3 or iSTFTNet) is recommended for phone-class CPUs.
  • Streaming is possible by chunking mel output and vocoding incrementally; latency budget depends on device.

Quantization, pruning, and distillation have not been applied in this release.

Ethical considerations

  • The speaker(s) in the Sunbird SALT corpus consented to the original dataset's terms; downstream use must respect those terms.
  • Because the model emits a synthesized Luganda voice, downstream applications should disclose synthetic speech to end users where appropriate (accessibility, consent, anti-impersonation).

Attribution

License

Apache-2.0. Use of the model must also comply with the license of the Sunbird SALT dataset used for training.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Cal3bd3v/ganda-nemo-experimental