--- license: apache-2.0 language: - en - es - pt - ru - fr - ja - ko - de - multilingual tags: - audio-generation - text-to-audio - text-to-speech - text-to-music - sound-effects - diffusion - multilingual library_name: transformers pipeline_tag: text-to-audio --- # Dasheng-AudioGen-Multilingual [](https://arxiv.org/abs/2605.27838) [](https://github.com/xiaomi-research/dasheng-audiogen) [](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) [](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen) [](https://nieeim.github.io/Dasheng-AudioGen-Web/) [**English**](./README.md) | [**中文**](./README_zh.md) **Dasheng-AudioGen-Multilingual** is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions.
## Models | Model | HuggingFace | Text Encoder | Language | |-------|-------------|-------------|:--------:| | Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English | | Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual | ### Language Support | Language | Duration (h) | Proportion | |----------|------------:|----------:| | English | 15,367.80 | 58.86% | | Spanish | 2,740.96 | 10.50% | | Portuguese | 1,916.24 | 7.34% | | Russian | 1,217.39 | 4.66% | | French | 933.91 | 3.58% | | Japanese | 874.51 | 3.35% | | Korean | 848.15 | 3.25% | | German | 842.29 | 3.23% | | Other | 1,369.16 | 5.24% | > **Note:** The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended. ## Installation ```bash pip install torch torchaudio "transformers<5" einops ``` > Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x. ## Prompt Format Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed. | Tag | Description | Required | |-----|-------------|:--------:| | `<\|caption\|>` | Overall audio scene description | Yes | | `<\|speech\|>` | Speaker identity and speaking style | No | | `<\|asr\|>` | Spoken transcript / dialogue | No | | `<\|sfx\|>` | Sound effects | No | | `<\|music\|>` | Background music | No | | `<\|env\|>` | Environmental ambience | No | **Rules:** - The prompt must begin with `<|caption|>` — prompts without it will be rejected. - Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music). > **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language. ## Quick Start ### Usage 1: Aspect-wise Composition Pass each aspect as a named argument. The `caption` field is required; all other fields are optional. ```python import torchaudio from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() prompt = model.compose_prompt( caption="A conversation scene on a busy city street.", speech="A young woman speaking softly in Spanish.", env="Rain and distant traffic noise.", asr="Creo que deberíamos irnos ya.", ) audio = model.generate(prompt) torchaudio.save("output.wav", audio.cpu(), 16000) ``` ### Usage 2: Pre-formatted Prompt String Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`. ```python import torchaudio from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() prompt = model.compose_prompt( prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise." ) audio = model.generate(prompt) torchaudio.save("output.wav", audio.cpu(), 16000) ``` ### Batch Inference ```python import torchaudio from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() prompts = [ model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."), model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."), model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."), ] audios = model.generate(prompts) for i, audio in enumerate(audios): torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000) ``` ### Generation Parameters ```python import torchaudio from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() prompt = model.compose_prompt(caption="A dog barking in a park") audio = model.generate( prompts=prompt, num_steps=25, # number of denoising steps (default: 25) guidance_scale=5.0, # classifier-free guidance scale (default: 5.0) sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear) ) torchaudio.save("output.wav", audio.cpu(), 16000) ``` ## Acknowledgments Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**. ## Citation ```bibtex @article{mei2026dashengaudiogen, title = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text}, author = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu}, journal = {arXiv preprint arXiv:2605.27838}, year = {2026} } ``` ## License This project is released under the [Apache License 2.0](LICENSE).