Ind-QwenTTS

A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.

Features

  • Multilingual: English + Gujarati
  • Accent Control: Indian & Gujarati accents
  • 4 voices (2 male, 2 female)
  • Accent transfer capability
  • Fast inference with 0.5B parameters

Supported Voices

Speaker ID Language Accent Gender
SPK_EN_M_001 English Indian Male
SPK_EN_F_001 English Indian Female
SPK_GU_M_001 Gujarati Gujarati Male
SPK_GU_F_001 Gujarati Gujarati Female

Installation

pip install transformers torch torchaudio snac torchcodec

Usage

import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()

def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
    if speaker is None:
        speaker_map = {
            ("english", "M"): "SPK_EN_M_001",
            ("english", "F"): "SPK_EN_F_001",
            ("gujarati", "M"): "SPK_GU_M_001",
            ("gujarati", "F"): "SPK_GU_F_001"
        }
        speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
    
    prompt = f"<lang>{language}</lang><accent>{accent}</accent><gender>{gender}</gender><speaker>{speaker}</speaker> {text}"
    
    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
    
    start_tokens = torch.tensor([
        tokenizer.convert_tokens_to_ids("<|endoftext|>"),
        tokenizer.convert_tokens_to_ids("<soh>"),
        tokenizer.convert_tokens_to_ids("<soa>"),
        tokenizer.convert_tokens_to_ids("<sos>")
    ], device=device).unsqueeze(0)
    
    full_input = torch.cat([input_ids, start_tokens], dim=1)
    
    with torch.no_grad():
        output = model.generate(
            full_input,
            max_new_tokens=1500,
            temperature=0.7,        
            top_p=0.85,
            repetition_penalty=1.15,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.convert_tokens_to_ids("<eos>")
        )
    
    generated_ids = output[0, full_input.shape[1]:]
    
    eos_id = tokenizer.convert_tokens_to_ids("<eos>")
    if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
        generated_ids = generated_ids[:-1]
    
    if len(generated_ids) % 7 != 0:
        trunc_len = (len(generated_ids) // 7) * 7
        generated_ids = generated_ids[:trunc_len]
    
    if len(generated_ids) == 0:
        print("Error: No audio generated.")
        return

    codes = generated_ids.reshape(-1, 7).T
    
    snac_offset = model.config.vocab_size - 4096
    codes = codes - snac_offset
    codes = torch.clamp(codes, min=0)
    
    l1 = codes[0, :]
    l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
    l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
    
    with torch.inference_mode():
        audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
    
    audio_tensor = audio.squeeze(0).cpu()
    torchaudio.save(output_file, audio_tensor, 24000)
    print(f"Saved to {output_file}")

generate_speech(
    text="The competition results will be announced tomorrow morning.",
    language="english",
    accent="indian",
    gender="M",
    output_file="test_english.wav"
)

Examples

Basic English synthesis:

generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")

Gujarati synthesis:

generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")

Audio Samples

Here are some samples generated by the model.

Description Speaker Audio
Indian English
Standard Generation
Male (SPK_EN_M_001)
Indian English
Long Narrative
Female (SPK_EN_F_001)
Gujarati
Native Speech
Female (SPK_GU_F_001)

Parameters

  • text: Text to synthesize
  • language: "english" or "gujarati"
  • accent: "indian" or "gujarati"
  • gender: "M" (male) or "F" (female)
  • speaker: Optional specific speaker ID (auto-selected if not provided)

Training Code

Training pipeline and scripts will be open-sourced soon.

Citation

@misc{ind-qwentts-2024,
  title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
  author={Aryan Purohit},
  year={2025}
}
Downloads last month
16
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AryanNsc/IND-QWENTTS-V1

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(489)
this model