Tech Report

ACE-Step Transcriber

Description

ACE-Step Transcriber is the annotation model used by ACE-Step v1.5 for training data labeling. It is a powerful multilingual audio transcription model capable of transcribing both speech and singing voice with high accuracy.

Key Features

  • 🌍 50+ Languages Support - Covers major world languages and regional dialects
  • 🎤 Speech Transcription - Accurately transcribes spoken content
  • 🎵 Singing Voice Transcription - Specialized in lyrics transcription with musical structure annotations
  • 🏷️ Structure Annotation - Automatically identifies song sections (verse, chorus, bridge, etc.)

Usage

The usage is the same as Qwen2.5 Omni-7B.

Prompt Format

Use the following prompt to transcribe audio:

*Task* Transcribe this audio in detail
<audio>

Output Format

The model outputs structured content in the following format:

# Languages
<language_code>

# Lyrics
[Section Tag - Optional Instrument]

<transcribed content>
...

Example Output

# Languages
en

# Lyrics
[Intro - Acoustic Guitar]

[Verse 1]
Walking down the empty street tonight
Stars are shining oh so bright
...

[Chorus]
This is where we belong
Singing our favorite song
...

Supported Section Tags

  • [Intro], [Outro]
  • [Verse 1], [Verse 2], etc.
  • [Chorus], [Pre-Chorus], [Post-Chorus]
  • [Bridge]
  • [Guitar Interlude], [Instrumental]
  • [Spoken]

Supported Languages (50+)

The model supports transcription in over 50 languages, including but not limited to:

Region Languages
East Asia Chinese (zh), Japanese (ja), Korean (ko)
Southeast Asia Vietnamese (vi), Thai (th), Indonesian (id), Malay (ms), Filipino (tl)
South Asia Hindi (hi), Bengali (bn), Tamil (ta), Urdu (ur)
Europe English (en), German (de), French (fr), Spanish (es), Italian (it), Portuguese (pt), Russian (ru), Polish (pl), Dutch (nl), Greek (el), Turkish (tr)
Middle East Arabic (ar), Hebrew (he), Persian (fa)
Others And many more regional languages...

Use Cases

  • Music Production - Transcribe reference tracks for lyrics extraction
  • Dataset Creation - Generate high-quality labeled data for music AI models
  • Accessibility - Create subtitles and captions for audio content
  • Music Analysis - Extract structural information from songs
Downloads last month
58
Safetensors
Model size
11B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ACE-Step/acestep-transcriber