| # Voice Tech for All: Technical Report |
|
|
| ## Multi-lingual Text-to-Speech System with Style Transfer |
|
|
| **Hackathon**: Voice Tech for All |
| **Date**: December 2025 |
|
|
| --- |
|
|
| ## Executive Summary |
|
|
| We present a **multi-lingual Text-to-Speech (TTS) system** supporting **11 Indian languages** with **style/prosody control** capabilities. The system is designed for deployment as a healthcare assistant for pregnant mothers in low-income communities, making health information accessible in native languages. |
|
|
| ### Key Achievements |
|
|
| | Metric | Value | |
| | ---------------------- | ----------------------------------------------------------------------------------------------------------- | |
| | Languages Supported | 11 (Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, Gujarati) | |
| | Voice Variants | 21 (male + female for each language) | |
| | Style Presets | 9 (default, slow, fast, soft, loud, happy, sad, calm, excited) | |
| | Average Inference Time | ~0.3s (CPU, Apple M2) | |
| | Model Size | ~300MB per voice (VITS), ~145MB (MMS) | |
| | API Latency | <500ms for typical sentences | |
|
|
| --- |
|
|
| ## 1. System Architecture |
|
|
| ### 1.1 Overview |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ REST API Server (FastAPI) โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ โ/synthesizeโ โ /voices โ โ /styles โโ |
| โ โ /stream โ โ /languages โ โ /health โโ |
| โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ TTS Engine โ |
| โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ |
| โ โ Text Normalizer โโ โ Tokenizer โโ โ VITS/MMS โ โ |
| โ โ (Indian scripts)โ โ (char-to-ID) โ โ Inference โ โ |
| โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ |
| โ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ โ Style Processor (Prosody Control) โโ |
| โ โ โข Pitch Shifting (librosa) โโ |
| โ โ โข Time Stretching (speed control) โโ |
| โ โ โข Energy/Volume Modification โโ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค |
| โ Model Repository โ |
| โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ SYSPIN VITS Models โ โ Facebook MMS Models โ โ |
| โ โ (10 languages) โ โ (Gujarati) โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### 1.2 Component Details |
|
|
| #### Text Normalizer |
|
|
| - Handles Indian script peculiarities |
| - Converts number notations: `{100}{เคเคเคธเฅ}` โ `เคเคเคธเฅ` |
| - Normalizes punctuation across scripts |
| - Handles code-switching (Hindi in English text) |
|
|
| #### VITS Models (SYSPIN) |
|
|
| - **Architecture**: Conditional Variational Autoencoder with Adversarial Learning |
| - **Training Data**: 20-30 hours per speaker from IISc Bangalore |
| - **Output**: 22050 Hz, 16-bit PCM |
| - **Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English |
|
|
| #### MMS Model (Facebook) |
|
|
| - **Architecture**: VITS-based, trained on MMS corpus |
| - **Output**: 16000 Hz |
| - **Languages**: Gujarati (and 1100+ others available) |
| - **Model Size**: 145MB |
|
|
| #### Style Processor |
|
|
| - **Pitch Shifting**: Using librosa phase vocoder |
| - **Time Stretching**: WSOLA algorithm via librosa |
| - **Energy Control**: Soft clipping with tanh for natural sound |
|
|
| --- |
|
|
| ## 2. API Specification |
|
|
| ### 2.1 Endpoints |
|
|
| | Endpoint | Method | Description | |
| | -------------------- | ------ | -------------------------------- | |
| | `/` | GET | API info and documentation links | |
| | `/health` | GET | System health and loaded models | |
| | `/voices` | GET | List all available voices | |
| | `/languages` | GET | List supported languages | |
| | `/styles` | GET | List style presets | |
| | `/synthesize` | POST | Generate speech from text | |
| | `/synthesize/get` | GET | Simple synthesis (for testing) | |
| | `/synthesize/stream` | POST | Streaming audio response | |
| | `/preload` | POST | Preload voice into memory | |
| | `/batch` | POST | Batch synthesis | |
|
|
| ### 2.2 Synthesis Request |
|
|
| ```json |
| { |
| "text": "เชจเชฎเชธเซเชคเซ, เชนเซเช เชคเชฎเชพเชฐเซ เชเซเชตเซ เชฐเซเชคเซ เชฎเชฆเชฆ เชเชฐเซ เชถเชเซเช?", |
| "voice": "gu_mms", |
| "speed": 1.0, |
| "pitch": 1.0, |
| "energy": 1.0, |
| "style": "calm", |
| "normalize": true |
| } |
| ``` |
|
|
| ### 2.3 Style Presets |
|
|
| | Preset | Speed | Pitch | Energy | Use Case | |
| | ------- | ----- | ----- | ------ | ---------------------- | |
| | default | 1.0 | 1.0 | 1.0 | Normal speech | |
| | slow | 0.75 | 1.0 | 1.0 | Elderly users, clarity | |
| | fast | 1.25 | 1.0 | 1.0 | Quick information | |
| | soft | 0.9 | 0.95 | 0.7 | Calming content | |
| | loud | 1.0 | 1.05 | 1.3 | Alerts, emphasis | |
| | happy | 1.1 | 1.1 | 1.2 | Positive messages | |
| | sad | 0.85 | 0.9 | 0.8 | Empathetic responses | |
| | calm | 0.9 | 0.95 | 0.85 | Healthcare guidance | |
| | excited | 1.2 | 1.15 | 1.3 | Celebrations | |
|
|
| --- |
|
|
| ## 3. Supported Languages |
|
|
| | Language | Code | Voices | Model Type | Sample Rate | |
| | ------------- | ---- | ------------ | ------------ | ----------- | |
| | Hindi | hi | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Bengali | bn | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Marathi | mr | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Telugu | te | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Kannada | kn | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Bhojpuri | bho | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Chhattisgarhi | hne | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Maithili | mai | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Magahi | mag | Male, Female | SYSPIN VITS | 22050 Hz | |
| | English | en | Male, Female | SYSPIN VITS | 22050 Hz | |
| | Gujarati | gu | Neutral | Facebook MMS | 16000 Hz | |
|
|
| --- |
|
|
| ## 4. Implementation Details |
|
|
| ### 4.1 Technology Stack |
|
|
| | Component | Technology | |
| | ----------------- | ---------------------------------------- | |
| | Backend Framework | FastAPI | |
| | ML Framework | PyTorch | |
| | TTS Models | VITS (Coqui AI / SYSPIN), MMS (Facebook) | |
| | Audio Processing | librosa, soundfile, scipy | |
| | Model Hub | Hugging Face Hub | |
| | API Documentation | OpenAPI/Swagger | |
|
|
| ### 4.2 Model Architecture - VITS |
|
|
| VITS (Conditional Variational Autoencoder with Adversarial Learning) was chosen for: |
|
|
| - **End-to-End Efficiency**: Combines acoustic modeling and vocoding in a single pass |
| - **High Quality**: Natural-sounding speech comparable to two-stage systems |
| - **Multi-Speaker Support**: Supports different speakers via embeddings |
| - **Fast Inference**: TorchScript JIT compilation for speed |
|
|
| ### 4.3 Style/Accent Transfer Implementation |
|
|
| Our style transfer uses **post-processing** approach for simplicity and reliability: |
|
|
| 1. **Pitch Shifting**: Phase vocoder via librosa |
|
|
| ```python |
| semitones = 12 * np.log2(pitch_factor) |
| shifted = librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones) |
| ``` |
|
|
| 2. **Time Stretching**: WSOLA algorithm |
|
|
| ```python |
| stretched = librosa.effects.time_stretch(audio, rate=speed_factor) |
| ``` |
|
|
| 3. **Energy Control**: Soft clipping for natural sound |
| ```python |
| modified = audio * energy_factor |
| if energy_factor > 1.0: |
| modified = np.tanh(modified * 2) * 0.95 # Soft clip |
| ``` |
|
|
| ### 4.4 Key Design Decisions |
|
|
| 1. **TorchScript Models**: JIT-compiled for faster inference |
| 2. **Lazy Loading**: Models loaded on-demand to minimize memory |
| 3. **CPU Fallback**: Apple Silicon MPS compatibility issues handled |
| 4. **Streaming Support**: Progressive audio delivery for real-time apps |
|
|
| --- |
|
|
| ## 5. Usage Examples |
|
|
| ### 5.1 Python API |
|
|
| ```python |
| from src.engine import TTSEngine |
| |
| # Initialize engine |
| engine = TTSEngine(device="auto") |
| |
| # Basic synthesis |
| output = engine.synthesize( |
| text="เคเคฐเฅเคญเคพเคตเคธเฅเคฅเคพ เคฎเฅเค เคธเฅเคตเคธเฅเคฅ เคเคนเคพเคฐ เคฌเคนเฅเคค เคฎเคนเคคเฅเคตเคชเฅเคฐเฅเคฃ เคนเฅ", |
| voice="hi_female" |
| ) |
| |
| # With style control |
| output = engine.synthesize( |
| text="เคเคชเคเคพ เคฆเคฟเคจ เคถเฅเคญ เคนเฅ", |
| voice="hi_male", |
| style="happy", |
| pitch=1.1 |
| ) |
| |
| # Gujarati |
| output = engine.synthesize( |
| text="เชธเซเชตเชธเซเชฅ เชฐเชนเซ, เชเซเชถ เชฐเชนเซ", |
| voice="gu_mms", |
| style="calm" |
| ) |
| ``` |
|
|
| ### 5.2 REST API |
|
|
| ```bash |
| # Basic synthesis |
| curl -X POST "http://localhost:8000/synthesize" \ |
| -H "Content-Type: application/json" \ |
| -d '{"text": "เคจเคฎเคธเฅเคคเฅ", "voice": "hi_male"}' \ |
| --output speech.wav |
| |
| # With style |
| curl -X POST "http://localhost:8000/synthesize" \ |
| -H "Content-Type: application/json" \ |
| -d '{"text": "เคเคชเคเคพ เคธเฅเคตเคพเคเคค เคนเฅ", "voice": "hi_female", "style": "happy"}' \ |
| --output welcome.wav |
| |
| # Gujarati |
| curl -X POST "http://localhost:8000/synthesize" \ |
| -H "Content-Type: application/json" \ |
| -d '{"text": "เชจเชฎเชธเซเชคเซ", "voice": "gu_mms"}' \ |
| --output gujarati.wav |
| ``` |
|
|
| ### 5.3 Command Line |
|
|
| ```bash |
| # Download models |
| python -m src.cli download --voice hi_male |
| python -m src.cli download --lang hi # All Hindi voices |
| |
| # Synthesize |
| python -m src.cli synthesize --text "เคจเคฎเคธเฅเคคเฅ" --voice hi_male --output hello.wav |
| |
| # Start server |
| python -m src.cli serve --port 8000 |
| ``` |
|
|
| --- |
|
|
| ## 6. Healthcare Use Case |
|
|
| ### 6.1 Target Application |
|
|
| The TTS system is designed for integration with an **LLM-based healthcare assistant** for pregnant mothers in low-income communities. |
|
|
| ### 6.2 Key Features for Healthcare |
|
|
| 1. **Multi-lingual Support**: Information in native languages |
| 2. **Calm Style Preset**: Reassuring tone for medical guidance |
| 3. **Slow Speed Option**: Clear pronunciation for instructions |
| 4. **Low Latency**: Real-time conversational responses |
|
|
| ### 6.3 Example Healthcare Dialogue |
|
|
| ``` |
| User: "เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชถเซเช เชเชพเชตเซเช เชเซเชเช?" |
| |
| System Response (TTS with calm style in Gujarati): |
| "เชเชฐเซเชญเชพเชตเชธเซเชฅเชพเชฎเชพเช เชคเชฎเชพเชฐเซ เชชเซเชฐเซเชเซเชจ, เชเชฏเชฐเซเชจ เช
เชจเซ เชซเซเชฒเชฟเช เชเชธเชฟเชกเชฅเซ เชญเชฐเชชเซเชฐ |
| เชเซเชฐเชพเช เชฒเซเชตเซ เชเซเชเช. เชฆเชพเชณ, เชชเชพเชฒเช, เชเชเชกเชพ เช
เชจเซ เชฆเซเชง เชธเชพเชฐเชพ เชตเชฟเชเชฒเซเชชเซ เชเซ." |
| ``` |
|
|
| --- |
|
|
| ## 7. Performance Benchmarks |
|
|
| | Test | Time | Notes | |
| | ----------------------- | ----- | ---------------------------------- | |
| | Hindi synthesis (short) | 0.25s | "เคจเคฎเคธเฅเคคเฅ" | |
| | Hindi synthesis (long) | 0.45s | 50-word sentence | |
| | Gujarati MMS | 0.35s | First load includes model download | |
| | Style processing | +0.1s | Pitch + speed adjustment | |
| | API round-trip | 0.5s | Including network overhead | |
|
|
| Hardware: Apple M2 Pro, 16GB RAM, CPU inference |
|
|
| --- |
|
|
| ## 8. Deployment |
|
|
| ### 8.1 Quick Start |
|
|
| ```bash |
| # Clone repository |
| git clone https://github.com/harshil748/VoiceAPI |
| cd VoiceAPI |
| |
| # Setup environment |
| python3 -m venv tts |
| source tts/bin/activate |
| pip install -r requirements.txt |
| |
| # Download a model |
| python -m src.cli download --voice hi_male |
| |
| # Start server |
| python -m src.cli serve --port 8000 |
| ``` |
|
|
| ### 8.2 Docker |
|
|
| ```dockerfile |
| FROM python:3.10-slim |
| WORKDIR /app |
| COPY . . |
| RUN pip install -r requirements.txt |
| RUN python -m src.cli download --lang hi |
| EXPOSE 8000 |
| CMD ["python", "-m", "src.cli", "serve"] |
| ``` |
|
|
| --- |
|
|
| ## 9. Limitations and Future Work |
|
|
| ### 9.1 Current Limitations |
|
|
| 1. **Model Size**: Each VITS model is ~300MB |
| 2. **MPS Compatibility**: Apple Silicon MPS not fully supported |
| 3. **Real-time Streaming**: Limited to sentence-level |
| 4. **Gujarati Gender**: MMS has only neutral voice |
|
|
| ### 9.2 Future Improvements |
|
|
| 1. **Model Quantization**: INT8 for smaller size |
| 2. **Voice Cloning**: Reference audio-based synthesis |
| 3. **SSML Support**: Markup language for fine control |
| 4. **More Languages**: Odia, Assamese, Punjabi |
| 5. **Fine-tuning**: Custom voice training on SPICOR data |
|
|
| --- |
|
|
| ## 10. Credits |
|
|
| ### Model Sources |
|
|
| | Source | Models | License | |
| | ----------------------- | --------------------- | ------------ | |
| | SYSPIN (IISc Bangalore) | VITS for 10 languages | CC BY 4.0 | |
| | Facebook MMS | Gujarati VITS | CC BY-NC 4.0 | |
|
|
| ### Dataset |
|
|
| - **SPICOR TTS Project**: IISc SPIRE Lab, Bangalore |
| - **Audio Quality**: 48kHz, 24-bit, mono |
|
|
| ### Frameworks |
|
|
| - Coqui TTS, Hugging Face Transformers, FastAPI, librosa |
|
|
| --- |
|
|
| ## 11. Conclusion |
|
|
| We have developed a comprehensive multi-lingual TTS system that: |
|
|
| โ
Supports **11 Indian languages** with 21 voice variants |
| โ
Provides **9 style presets** for prosody control |
| โ
Offers a **REST API** with OpenAPI documentation |
| โ
Achieves **<500ms latency** for typical sentences |
| โ
Is **production-ready** with proper error handling |
|
|
| The system is well-suited for the healthcare assistant use case, providing clear, natural-sounding speech in native languages to help pregnant mothers access healthcare information. |
|
|
| --- |
|
|
| **Repository**: https://github.com/harshil748/VoiceAPI |
| **API Documentation**: http://localhost:8000/docs |
|
|