--- license: apache-2.0 language: - vi library_name: pytorch pipeline_tag: text-to-speech tags: - text-to-speech - zero-shot-tts - voice-cloning - vietnamese - zipvoice base_model: k2-fsa/ZipVoice --- # ViZipVoice Vietnamese zero-shot TTS / voice cloning fine-tuned from [ZipVoice](https://github.com/k2-fsa/ZipVoice). - GitHub: https://github.com/iamdinhthuan/ViZipvoice - Model repo: https://huggingface.co/contextboxai/ViZipvoice - Space: https://huggingface.co/spaces/dinhthuan/ViZipvoice - Latest checkpoint: `checkpoint-1860000.pt`, FP16 inference state dict - Training data: about `7000` total hours, including roughly `6500` hours of Vietnamese and `500` hours of English - Tokenizer: `SimpleTokenizer`, character-level, `244` tokens - Sample rate: `24 kHz` - Default vocoder: `charactr/vocos-mel-24khz` The wrapper loads the largest `checkpoint-.pt` automatically and uses `soe-vinorm` for Vietnamese text normalization. ## Audio Demo Generated with `checkpoint-1860000.pt`, the current wrapper flow, and the demo text in `demo/demo_text.txt`. **Đinh-Quyết** [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_01_%C4%90inh-Quy%E1%BA%BFt.wav) **Nhã-Uyên** [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_02_Nh%C3%A3-Uy%C3%AAn.wav) **MC** [Open audio](https://huggingface.co/contextboxai/ViZipvoice/resolve/main/demo/demo_03_MC.wav) ## Install ```bash git clone https://github.com/iamdinhthuan/ViZipvoice.git cd ViZipvoice pip install -r requirements.txt export PYTHONPATH="$PWD:$PYTHONPATH" ``` ## CLI ```bash python3 -m zipvoice.bin.infer_vizipvoice \ --prompt-wav prompt.wav \ --prompt-text "Xin chào, đây là giọng mẫu của tôi." \ --text "ViZipVoice có thể tổng hợp giọng nói tiếng Việt từ một đoạn mẫu ngắn." \ --res-wav-path output.wav ``` The CLI downloads this model repo by default. Use `--model-dir models/ViZipvoice` after downloading files locally. ## Python ```python from zipvoice.vizipvoice import ViZipVoiceTTS tts = ViZipVoiceTTS() metrics = tts.synthesize( prompt_wav="prompt.wav", prompt_text="Xin chào, đây là giọng mẫu của tôi.", text="Đây là câu tiếng Việt được sinh bởi ViZipVoice.", output_path="output.wav", ) print(metrics) ``` ## Reference Audio `audio/` contains 30 reference prompts. Each audio file has a sidecar `.txt` transcript with the same basename: ```text audio/Đinh-Quyết.mp3 audio/Đinh-Quyết.txt ``` Names only keep the audio/person name; the original `lar_*` prefix and `Pro` suffix are removed. The Gradio app reads this sidecar format automatically. ```bash huggingface-cli download contextboxai/ViZipvoice \ --local-dir models/ViZipvoice \ --local-dir-use-symlinks False python3 egs/zipvoice/gradio_app.py --exp-dir models/ViZipvoice ``` ## Inference Flow The CLI, Python wrapper, and Gradio app use the same default flow: - normalize Vietnamese text with `soe-vinorm`, then clean spaces around punctuation; - split long text into sentences; - for a `1`-word sentence: use at least `24` steps and `speed=0.6`; - for a `2-4` word sentence: use `speed=0.8`; - generate each segment separately; - merge segments with silence, crossfade, fade in, and fade out. Useful knobs: ```bash --no-vietnamese-normalize --no-split-sentences --crossfade-ms 80 --silence-ms 180 --fade-in-ms 20 --fade-out-ms 80 ``` ## Files - `checkpoint-1860000.pt`: latest FP16 checkpoint - `config.json`, `model.json`: model config - `tokens.txt`: Vietnamese character tokenizer - `audio/`: 30 reference audios plus `.txt` transcripts - `demo/`: regenerated audio demos and `metadata.json` - `vizipvoice.py`: wrapper mirrored from GitHub ## Responsible Use This model can clone voices from short audio prompts. Use only voices you own or have explicit permission to use. Do not use it for impersonation, fraud, harassment, misinformation, or other harmful content. ## License Apache License 2.0. Please also credit the original ZipVoice project.