---
license: apache-2.0
language:
  - en
  - es
  - pt
  - ru
  - fr
  - ja
  - ko
  - de
  - multilingual
tags:
  - audio-generation
  - text-to-audio
  - text-to-speech
  - text-to-music
  - sound-effects
  - diffusion
  - multilingual
library_name: transformers
pipeline_tag: text-to-audio
---

# Dasheng-AudioGen-Multilingual

[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2605.27838)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/xiaomi-research/dasheng-audiogen) 
[![Hugging Face Model](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual)
[![Hugging Face Demo](https://img.shields.io/badge/HuggingFace-Demo-orange?logo=huggingface)](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen)
[![Web Demo](https://img.shields.io/badge/Website-Demo-181717?logo=google-chrome)](https://nieeim.github.io/Dasheng-AudioGen-Web/)
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) -->

[**English**](./README.md) | [**中文**](./README_zh.md)

**Dasheng-AudioGen-Multilingual** is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions.

<p align="center">
  <video
    src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900"
    controls
    autoplay
    muted
    loop
    playsinline
    width="85%">
  </video>
</p>

## Models

| Model | HuggingFace | Text Encoder | Language |
|-------|-------------|-------------|:--------:|
| Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English |
| Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual |

### Language Support

| Language | Duration (h) | Proportion |
|----------|------------:|----------:|
| English | 15,367.80 | 58.86% |
| Spanish | 2,740.96 | 10.50% |
| Portuguese | 1,916.24 | 7.34% |
| Russian | 1,217.39 | 4.66% |
| French | 933.91 | 3.58% |
| Japanese | 874.51 | 3.35% |
| Korean | 848.15 | 3.25% |
| German | 842.29 | 3.23% |
| Other | 1,369.16 | 5.24% |

> **Note:** The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended.

## Installation

```bash
pip install torch torchaudio "transformers<5" einops
```

> Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.

## Prompt Format

Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed.

| Tag | Description | Required |
|-----|-------------|:--------:|
| `<\|caption\|>` | Overall audio scene description | Yes |
| `<\|speech\|>` | Speaker identity and speaking style | No |
| `<\|asr\|>` | Spoken transcript / dialogue | No |
| `<\|sfx\|>` | Sound effects | No |
| `<\|music\|>` | Background music | No |
| `<\|env\|>` | Environmental ambience | No |

**Rules:**
- The prompt must begin with `<|caption|>` — prompts without it will be rejected.
- Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music).

> **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language.

## Quick Start

### Usage 1: Aspect-wise Composition

Pass each aspect as a named argument. The `caption` field is required; all other fields are optional.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    caption="A conversation scene on a busy city street.",
    speech="A young woman speaking softly in Spanish.",
    env="Rain and distant traffic noise.",
    asr="Creo que deberíamos irnos ya.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Usage 2: Pre-formatted Prompt String

Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise."
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Batch Inference

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompts = [
    model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
    model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
    model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)

for i, audio in enumerate(audios):
    torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
```

### Generation Parameters

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(caption="A dog barking in a park")
audio = model.generate(
    prompts=prompt,
    num_steps=25,              # number of denoising steps (default: 25)
    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)
    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)
)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

## Acknowledgments

Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**.

## Citation

```bibtex
@article{mei2026dashengaudiogen,
  title   = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text},
  author  = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu},
  journal = {arXiv preprint arXiv:2605.27838},
  year    = {2026}
}
```

## License

This project is released under the [Apache License 2.0](LICENSE).