bol-tts-marathi — Kokoro-82M fine-tuned for Marathi

A Marathi (मराठी) text-to-speech fine-tune of hexgrad/Kokoro-82M, trained with the semidark/kokoro-deutsch recipe. Handles pure Marathi and Minglish (Marathi + English code-switching) via a client-side Devanagari-transliteration preprocessor.

Voice catalog (25 voices)

Marathi-trained (4)

ID Display Source Default speed
mf_asha Asha (आशा) Rasa marathi_female 1.00×
mm_vivek Vivek (विवेक) Rasa marathi_male 1.00×
mf_mukta Mukta (मुक्ता) SPRINGLab female 0.80×
mm_dnyanesh Dnyanesh (ज्ञानेश) SPRINGLab male 0.80×

Stock-Kokoro crossovers (19)

Stock voicepacks from hexgrad/kokoro.js used as ref_s on this fine-tune. Because v0.2 is a continuation fine-tune, the encoder latent space stays close enough to stock Kokoro's that stock voicepacks plug in directly. Pre-screened by peak < 0.95 to filter ones that clip.

ID Display Source language
af_heart Svara (स्वरा) US English F
af_alloy Anvita (अन्विता) US English F
af_aoede Sanika (सानिका) US English F
af_bella Naina (नैना) US English F
af_jessica Ishani (ईशानी) US English F
af_nova Tara (तारा) US English F
af_sarah Kavya (काव्या) US English F
af_sky Akasha (आकाशा) US English F
am_liam Atharv (अथर्व) US English M
bf_isabella Ira (इरा) UK English F
bm_fable Aaryan (आर्यन) UK English M
ff_siwis Esha (ईशा) French F
hm_omega Vihaan (विहान) Hindi M
im_nicola Niraj (निरज) Italian M
pf_dora Rhea (रिया) Portuguese F
zf_xiaoni Nyra (नयरा) Mandarin F
zf_xiaoxiao Pari (परी) Mandarin F (kid)
zf_xiaoyi Vir (वीर) Mandarin F (perceived M kid)
zm_yunyang Aakash (आकाश) Mandarin M

Synthetic — generated arithmetically with no reference audio (2)

ID Display Recipe
syn_sama Sama (समा) Centroid (mean) of 5 modern English female voicepacks
syn_navya Navya (नव्या) Centroid + per-position Gaussian noise (1σ)

The voicepack tensor [510, 1, 256] is a plain embedding — it can be constructed by averaging existing voicepacks, sampling near the centroid, or interpolating. See voicepack zoo in the repo for recipes.

Usage

import torch, soundfile as sf
from kokoro import KModel, KPipeline
import kokoro.pipeline as _kp

_kp.LANG_CODES["m"] = "mr"  # monkey-patch Marathi lang code

kmodel = KModel(
    repo_id="shreyask/bol-tts-marathi",
    config="config.json",
    model="kokoro-mr-v0_2.pth",
)
kmodel.train(False)

pipeline = KPipeline(lang_code="m", repo_id="shreyask/bol-tts-marathi", model=kmodel)
voice = torch.load("voices/mf_asha.pt", map_location="cpu", weights_only=True)

text = "नमस्कार, मी मराठी बोलतो."
chunks = []
for _gs, _ps, audio in pipeline(text, voice=voice, speed=1.0):
    chunks.append(audio)

sf.write("out.wav", chunks[0].numpy() if len(chunks) == 1 else torch.cat(chunks).numpy(), 24000)

Minglish (loanword) handling

For Marathi mixed with English ("Friday ला Zomato वर dinner order करूया का?"), use the loanword preprocessor first to transliterate Latin tokens to Devanagari before phonemization:

from preprocess_loanwords import preprocess
text = preprocess("Friday ला Zomato वर dinner order करूया का?")
# → "फ्रायडे ला झोमॅटो वर डिनर ऑर्डर करूया का?"
# Then feed to the pipeline as usual.

Source + ~19,500-entry lookup table: scripts/preprocess_loanwords.py and data/loanword_map.json.

Per-voice timestamps

Kokoro predicts per-phoneme durations. KModel.forward_with_tokens returns (audio, pred_dur). pred_dur is in predictor frames where 1 frame = 600 audio samples at 24 kHz (the prosody predictor runs at half the mel-frame rate; the decoder upsamples 2× before iSTFT):

audio, pred_dur = kmodel.forward_with_tokens(input_ids, ref_s, speed=1.0)
durations_sec = pred_dur.squeeze().cpu().numpy() * 600 / 24000
starts = durations_sec.cumsum() - durations_sec
# (starts[i], starts[i] + durations_sec[i]) is the time span of phoneme[i]

Training

Phase Details
Base hexgrad/Kokoro-82M
Stage 1 10 epochs, bs=12, fp32, ~9h on A100 SXM 80GB. Final val_loss ≈ 0.23
Stage 2 10 epochs, bs=8, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, ~13h
Train utts 24,676 (95/5 split)
Speakers 331 (2 Rasa + 329 IndicVoices-R) + SPRINGLab IndicTTS-Marathi (single F + single M)
Vocab change ɭ (U+026D, retroflex lateral) at Kokoro slot 144 — Marathi-specific phoneme that Hindi doesn't have

Full methodology: TRAINING_GUIDE.md.

Datasets

Limitations

  • Pure-English-only sentences — the decoder hallucinates Marathi acoustics if you don't give it any Devanagari context. The Minglish trick handles mixed input via Devanagari transliteration; pure English needs a different fallback.
  • Long-tail loanwords — the 19,500-entry map covers high-frequency English words in Indian usage; rarer words fall through to espeak-mr unchanged.
  • Decoder English-leakage is accidental, not designed — v0.2's decoder happens to render /ɟʰ/ (Devanagari झ) with an English-flavored /z/ quality, which makes "amazing" → अमेझिंग → audible "amazing." The follow-up v0.5 retraining lost this property by being more correctly Marathi; v0.6 is planned to preserve the leakage deliberately.

License

Apache 2.0. Training data under their respective licenses (Rasa CC-BY-4.0, IndicVoices-R CC-BY-4.0, SPRINGLab IITM EULA).

Citation

@software{bol_tts_marathi_2026,
  title={bol-tts-marathi: Kokoro-82M fine-tuned for Marathi},
  author={Karnik, Shreyas},
  year={2026},
  url={https://github.com/shreyaskarnik/bol-tts-marathi},
  license={Apache-2.0}
}
@software{kokoro_2025,
  title={Kokoro-82M},
  author={hexgrad},
  year={2025},
  url={https://github.com/hexgrad/kokoro}
}
@software{kokoro_deutsch_2026,
  title={kokoro-deutsch},
  author={semidark},
  year={2026},
  url={https://github.com/semidark/kokoro-deutsch}
}
Downloads last month
447
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shreyask/bol-tts-marathi

Finetuned
(25)
this model
Quantizations
1 model

Datasets used to train shreyask/bol-tts-marathi