Instructions to use mispeech/Dasheng-AudioGen-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mispeech/Dasheng-AudioGen-Multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
license: apache-2.0
language:
- en
- es
- pt
- ru
- fr
- ja
- ko
- de
- multilingual
tags:
- audio-generation
- text-to-audio
- text-to-speech
- text-to-music
- sound-effects
- diffusion
- multilingual
library_name: transformers
pipeline_tag: text-to-audio
Dasheng-AudioGen-Multilingual
Dasheng-AudioGen-Multilingual is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.
Models
| Model | HuggingFace | Text Encoder | Language |
|---|---|---|---|
| Dasheng-AudioGen | mispeech/Dasheng-AudioGen | google/flan-t5-large |
English |
| Dasheng-AudioGen-Multilingual | mispeech/Dasheng-AudioGen-Multilingual | google/mt5-large |
Multilingual |
Language Support
| Language | Duration (h) | Proportion |
|---|---|---|
| English | 15,367.80 | 58.86% |
| Spanish | 2,740.96 | 10.50% |
| Portuguese | 1,916.24 | 7.34% |
| Russian | 1,217.39 | 4.66% |
| French | 933.91 | 3.58% |
| Japanese | 874.51 | 3.35% |
| Korean | 848.15 | 3.25% |
| German | 842.29 | 3.23% |
| Other | 1,369.16 | 5.24% |
Note: The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (
mispeech/Dasheng-AudioGen) is recommended.
Installation
pip install torch torchaudio "transformers<5" einops
Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.
Prompt Format
Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt must start with the <|caption|> tag, which provides the overall scene description. Other tags are optional and can be included as needed.
| Tag | Description | Required |
|---|---|---|
<|caption|> |
Overall audio scene description | Yes |
<|speech|> |
Speaker identity and speaking style | No |
<|asr|> |
Spoken transcript / dialogue | No |
<|sfx|> |
Sound effects | No |
<|music|> |
Background music | No |
<|env|> |
Environmental ambience | No |
Rules:
- The prompt must begin with
<|caption|>— prompts without it will be rejected. - Only include tags that are relevant; omit tags with no content (e.g., skip
<|music|>if there is no music).
Multilingual prompt convention: All descriptive tags (
caption,speech,sfx,music,env) should be written in English. Only the<|asr|>field (the actual spoken content to be synthesized) should use the target language.
Quick Start
Usage 1: Aspect-wise Composition
Pass each aspect as a named argument. The caption field is required; all other fields are optional.
import torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()
prompt = model.compose_prompt(
caption="A conversation scene on a busy city street.",
speech="A young woman speaking softly in Spanish.",
env="Rain and distant traffic noise.",
asr="Creo que deberíamos irnos ya.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
Usage 2: Pre-formatted Prompt String
Pass a complete tagged string via the prompt parameter. The string must start with <|caption|>.
import torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()
prompt = model.compose_prompt(
prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise."
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
Batch Inference
import torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()
prompts = [
model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)
for i, audio in enumerate(audios):
torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
Generation Parameters
import torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()
prompt = model.compose_prompt(caption="A dog barking in a park")
audio = model.generate(
prompts=prompt,
num_steps=25, # number of denoising steps (default: 25)
guidance_scale=5.0, # classifier-free guidance scale (default: 5.0)
sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear)
)
torchaudio.save("output.wav", audio.cpu(), 16000)
Acknowledgments
Dasheng-AudioGen was developed with contributions from XIAOMI LLM PLUS and SJTU X-LANCE.
Citation
@article{dasheng-audiogen,
title={Dasheng-AudioGen},
author={},
journal={arXiv preprint arXiv:2505.XXXXX},
year={2025}
}
License
This project is released under the Apache License 2.0.