Ind-QwenTTS
A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.
Features
- Multilingual: English + Gujarati
- Accent Control: Indian & Gujarati accents
- 4 voices (2 male, 2 female)
- Accent transfer capability
- Fast inference with 0.5B parameters
Supported Voices
| Speaker ID | Language | Accent | Gender |
|---|---|---|---|
SPK_EN_M_001 |
English | Indian | Male |
SPK_EN_F_001 |
English | Indian | Female |
SPK_GU_M_001 |
Gujarati | Gujarati | Male |
SPK_GU_F_001 |
Gujarati | Gujarati | Female |
Installation
pip install transformers torch torchaudio snac torchcodec
Usage
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()
def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
if speaker is None:
speaker_map = {
("english", "M"): "SPK_EN_M_001",
("english", "F"): "SPK_EN_F_001",
("gujarati", "M"): "SPK_GU_M_001",
("gujarati", "F"): "SPK_GU_F_001"
}
speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
prompt = f"<lang>{language}</lang><accent>{accent}</accent><gender>{gender}</gender><speaker>{speaker}</speaker> {text}"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
start_tokens = torch.tensor([
tokenizer.convert_tokens_to_ids("<|endoftext|>"),
tokenizer.convert_tokens_to_ids("<soh>"),
tokenizer.convert_tokens_to_ids("<soa>"),
tokenizer.convert_tokens_to_ids("<sos>")
], device=device).unsqueeze(0)
full_input = torch.cat([input_ids, start_tokens], dim=1)
with torch.no_grad():
output = model.generate(
full_input,
max_new_tokens=1500,
temperature=0.7,
top_p=0.85,
repetition_penalty=1.15,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.convert_tokens_to_ids("<eos>")
)
generated_ids = output[0, full_input.shape[1]:]
eos_id = tokenizer.convert_tokens_to_ids("<eos>")
if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
generated_ids = generated_ids[:-1]
if len(generated_ids) % 7 != 0:
trunc_len = (len(generated_ids) // 7) * 7
generated_ids = generated_ids[:trunc_len]
if len(generated_ids) == 0:
print("Error: No audio generated.")
return
codes = generated_ids.reshape(-1, 7).T
snac_offset = model.config.vocab_size - 4096
codes = codes - snac_offset
codes = torch.clamp(codes, min=0)
l1 = codes[0, :]
l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
with torch.inference_mode():
audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
audio_tensor = audio.squeeze(0).cpu()
torchaudio.save(output_file, audio_tensor, 24000)
print(f"Saved to {output_file}")
generate_speech(
text="The competition results will be announced tomorrow morning.",
language="english",
accent="indian",
gender="M",
output_file="test_english.wav"
)
Examples
Basic English synthesis:
generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")
Gujarati synthesis:
generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")
Audio Samples
Here are some samples generated by the model.
| Description | Speaker | Audio |
|---|---|---|
| Indian English Standard Generation |
Male (SPK_EN_M_001) |
|
| Indian English Long Narrative |
Female (SPK_EN_F_001) |
|
| Gujarati Native Speech |
Female (SPK_GU_F_001) |
Parameters
text: Text to synthesizelanguage:"english"or"gujarati"accent:"indian"or"gujarati"gender:"M"(male) or"F"(female)speaker: Optional specific speaker ID (auto-selected if not provided)
Training Code
Training pipeline and scripts will be open-sourced soon.
Citation
@misc{ind-qwentts-2024,
title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
author={Aryan Purohit},
year={2025}
}
- Downloads last month
- 16
Model tree for AryanNsc/IND-QWENTTS-V1
Base model
Qwen/Qwen2.5-0.5B