Update README.md

fdc7907 verified about 11 hours ago

6.92 kB

license: apache-2.0
language:
  - en
  - es
  - pt
  - ru
  - fr
  - ja
  - ko
  - de
  - multilingual
tags:
  - audio-generation
  - text-to-audio
  - text-to-speech
  - text-to-music
  - sound-effects
  - diffusion
  - multilingual
library_name: transformers
pipeline_tag: text-to-audio

Dasheng-AudioGen-Multilingual

English | 中文

Dasheng-AudioGen-Multilingual is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.

Models

Model	HuggingFace	Text Encoder	Language
Dasheng-AudioGen	mispeech/Dasheng-AudioGen	`google/flan-t5-large`	English
Dasheng-AudioGen-Multilingual	mispeech/Dasheng-AudioGen-Multilingual	`google/mt5-large`	Multilingual

Language Support

Language	Duration (h)	Proportion
English	15,367.80	58.86%
Spanish	2,740.96	10.50%
Portuguese	1,916.24	7.34%
Russian	1,217.39	4.66%
French	933.91	3.58%
Japanese	874.51	3.35%
Korean	848.15	3.25%
German	842.29	3.23%
Other	1,369.16	5.24%

Note: The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (mispeech/Dasheng-AudioGen) is recommended.

Installation

pip install torch torchaudio "transformers<5" einops

Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.

Prompt Format

Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt must start with the <|caption|> tag, which provides the overall scene description. Other tags are optional and can be included as needed.

Tag	Description	Required
`<\|caption\|>`	Overall audio scene description	Yes
`<\|speech\|>`	Speaker identity and speaking style	No
`<\|asr\|>`	Spoken transcript / dialogue	No
`<\|sfx\|>`	Sound effects	No
`<\|music\|>`	Background music	No
`<\|env\|>`	Environmental ambience	No

Rules:

The prompt must begin with <|caption|> — prompts without it will be rejected.
Only include tags that are relevant; omit tags with no content (e.g., skip <|music|> if there is no music).

Multilingual prompt convention: All descriptive tags (caption, speech, sfx, music, env) should be written in English. Only the <|asr|> field (the actual spoken content to be synthesized) should use the target language.

Quick Start

Usage 1: Aspect-wise Composition

Pass each aspect as a named argument. The caption field is required; all other fields are optional.

import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    caption="A conversation scene on a busy city street.",
    speech="A young woman speaking softly in Spanish.",
    env="Rain and distant traffic noise.",
    asr="Creo que deberíamos irnos ya.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)

Usage 2: Pre-formatted Prompt String

Pass a complete tagged string via the prompt parameter. The string must start with <|caption|>.

import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise."
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)

Batch Inference

import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompts = [
    model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
    model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
    model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)

for i, audio in enumerate(audios):
    torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)

Generation Parameters

import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(caption="A dog barking in a park")
audio = model.generate(
    prompts=prompt,
    num_steps=25,              # number of denoising steps (default: 25)
    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)
    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)
)
torchaudio.save("output.wav", audio.cpu(), 16000)

Acknowledgments

Dasheng-AudioGen was developed with contributions from XIAOMI LLM PLUS and SJTU X-LANCE.

Citation

@article{dasheng-audiogen,
  title={Dasheng-AudioGen},
  author={},
  journal={arXiv preprint arXiv:2505.XXXXX},
  year={2025}
}

License

This project is released under the Apache License 2.0.