| --- |
| colorFrom: blue |
| colorTo: purple |
| sdk: docker |
| app_port: 7860 |
| license: mit |
| title: VoiceAPI |
| tags: |
| - tts |
| - text-to-speech |
| - indian-languages |
| - vits |
| - multilingual |
| - speech-synthesis |
| language: |
| - hi |
| - bn |
| - mr |
| - te |
| - kn |
| - en |
| - bho |
| - mai |
| - mag |
| - hne |
| - gu |
| --- |
| |
| # 🎙️ VoiceAPI - Multi-lingual Indian Language TTS |
|
|
| An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options. |
|
|
|
|
| ## 🌟 Features |
|
|
| - **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English |
| - **21 Voice Options**: Male and female voices for each language |
| - **High-Quality Audio**: 22050 Hz sample rate, natural prosody |
| - **REST API**: Simple GET/POST endpoints for easy integration |
| - **Real-time Synthesis**: Fast inference on CPU/GPU |
|
|
| ## 🗣️ Supported Languages |
|
|
| | Language | Code | Female | Male | Script | |
| |----------|------|--------|------|--------| |
| | Hindi | hi | ✅ | ✅ | देवनागरी | |
| | Bengali | bn | ✅ | ✅ | বাংলা | |
| | Marathi | mr | ✅ | ✅ | देवनागरी | |
| | Telugu | te | ✅ | ✅ | తెలుగు | |
| | Kannada | kn | ✅ | ✅ | ಕನ್ನಡ | |
| | Gujarati | gu | ✅ (MMS) | - | ગુજરાતી | |
| | Bhojpuri | bho | ✅ | ✅ | देवनागरी | |
| | Chhattisgarhi | hne | ✅ | ✅ | देवनागरी | |
| | Maithili | mai | ✅ | ✅ | देवनागरी | |
| | Magahi | mag | ✅ | ✅ | देवनागरी | |
| | English | en | ✅ | ✅ | Latin | |
|
|
| ## 📡 API Usage |
|
|
| ### Endpoint |
|
|
| \`\`\` |
| [https://harshil748-voiceapi.hf.space/](https://harshil748-voiceapi.hf.space/) |
| \`\`\` |
|
|
| ### Parameters |
|
|
| | Parameter | Type | Required | Description | |
| |-----------|------|----------|-------------| |
| | \`text\` | string | Yes | Text to synthesize (lowercase for English) | |
| | \`lang\` | string | Yes | Language name (hindi, bengali, etc.) | |
| | \`speaker_wav\` | file | Yes | Reference WAV file (for API compatibility) | |
|
|
| ### Example (Python) |
|
|
| \`\`\`python |
| import requests |
|
|
| base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference' |
| WavPath = 'reference.wav' |
|
|
| params = { |
| 'text': 'नमस्ते, आप कैसे हैं?', |
| 'lang': 'hindi', |
| } |
| |
| with open(WavPath, "rb") as AudioFile: |
| response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()}) |
| |
| if response.status_code == 200: |
| with open('output.wav', 'wb') as f: |
| f.write(response.content) |
| print("Audio saved as 'output.wav'") |
| \`\`\` |
| |
| ### Example (cURL) |
| |
| \`\`\`bash |
| curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\ |
| -F "speaker_wav=@reference.wav" \\ |
| -o output.wav |
| \`\`\` |
| |
| ## 🏗️ Model Architecture |
| |
| - **Base Model**: VITS (Variational Inference with adversarial learning for Text-to-Speech) |
| - **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels) |
| - **Decoder**: HiFi-GAN neural vocoder |
| - **Duration Predictor**: Stochastic duration predictor for natural prosody |
| - **Sample Rate**: 22050 Hz (16000 Hz for Gujarati MMS) |
| |
| ## 📊 Training |
| |
| ### Datasets Used |
| |
| | Dataset | Languages | Source | License | |
| |---------|-----------|--------|---------| |
| | OpenSLR-103 | Hindi | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 | |
| | OpenSLR-37 | Bengali | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 | |
| | OpenSLR-64 | Marathi | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 | |
| | OpenSLR-66 | Telugu | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 | |
| | OpenSLR-79 | Kannada | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 | |
| | OpenSLR-78 | Gujarati | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 | |
| | Common Voice | Hindi, Bengali | [Mozilla](https://commonvoice.mozilla.org/) | CC0 | |
| | IndicTTS | Multiple | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research | |
| | Indic-Voices | Multiple | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 | |
| |
| ### Training Configuration |
| |
| - **Epochs**: 1000 |
| - **Batch Size**: 32 |
| - **Learning Rate**: 2e-4 |
| - **Optimizer**: AdamW |
| - **FP16 Training**: Enabled |
| - **Hardware**: NVIDIA V100/A100 GPUs |
| |
| See \`training/\` directory for full training scripts and configurations. |
| |
| ## 🚀 Deployment |
| |
| This API is deployed on HuggingFace Spaces using Docker: |
| |
| \`\`\`dockerfile |
| FROM python:3.10-slim |
| # ... installs dependencies |
| # Downloads models from Harshil748/VoiceAPI-Models |
| # Runs FastAPI server on port 7860 |
| \`\`\` |
| |
| Models are hosted separately at [Harshil748/VoiceAPI-Models](https://huggingface.co/Harshil748/VoiceAPI-Models) (~8GB). |
| |
| ## 📁 Project Structure |
| |
| \`\`\` |
| |
| VoiceAPI/ |
| ├── app.py # HuggingFace Spaces entry point |
| ├── Dockerfile # Docker configuration |
| ├── requirements.txt # Python dependencies |
| ├── download_models.py # Model downloader |
| ├── src/ |
| │ ├── api.py # FastAPI REST server |
| │ ├── engine.py # TTS inference engine |
| │ ├── config.py # Voice configurations |
| │ └── tokenizer.py # Text tokenization |
| └── training/ |
| ├── train_vits.py # VITS training script |
| ├── prepare_dataset.py # Data preparation |
| ├── export_model.py # Model export |
| ├── datasets.csv # Dataset links |
| └── configs/ # Training configs |
| |
| \`\`\` |
| |
| ## 📜 License |
|
|
| - **Code**: MIT License |
| - **Models**: CC BY 4.0 (following SYSPIN licensing) |
| - **Datasets**: Individual licenses (see training/datasets.csv) |
|
|
| ## 🙏 Acknowledgments |
|
|
| - [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for pre-trained VITS models |
| - [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for Gujarati TTS |
| - [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library |
| - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources |
|
|
| ## 📧 Contact |
|
|
| Built for the **Voice Tech for All** Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities. |
|
|