Audio-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen2
text-generation
speech-language-model
streaming
audio
multimodal
qwen2.5-omni
text-generation-inference
Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForMultimodalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
Update model card with official links, citation, and paper information
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
-
license: apache-2.0
|
| 6 |
library_name: transformers
|
|
|
|
| 7 |
pipeline_tag: audio-text-to-text
|
| 8 |
-
datasets:
|
| 9 |
-
- zhifeixie/StreamAudio-2M
|
| 10 |
tags:
|
| 11 |
- speech-language-model
|
| 12 |
- streaming
|
|
@@ -14,13 +14,14 @@ tags:
|
|
| 14 |
- multimodal
|
| 15 |
- qwen2.5-omni
|
| 16 |
---
|
|
|
|
| 17 |
# Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
|
| 18 |
|
| 19 |
-
[**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/
|
| 20 |
|
| 21 |
-
Audio-Interaction is a streaming
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
## Model Details
|
| 26 |
|
|
@@ -54,14 +55,14 @@ Audio-Interaction/
|
|
| 54 |
|
| 55 |
## Intended Use
|
| 56 |
|
| 57 |
-
Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow.
|
| 58 |
|
| 59 |
## Quick Start
|
| 60 |
|
| 61 |
### Installation
|
| 62 |
|
| 63 |
```bash
|
| 64 |
-
git clone https://github.com/xzf-thu/Audio-Interaction.git
|
| 65 |
cd Audio-Interaction
|
| 66 |
conda create -n Audio-Interaction python=3.10 -y
|
| 67 |
conda activate Audio-Interaction
|
|
@@ -78,7 +79,7 @@ from huggingface_hub import snapshot_download
|
|
| 78 |
snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
|
| 79 |
```
|
| 80 |
|
| 81 |
-
`snapshot_download` is the recommended path — it pulls every file
|
| 82 |
|
| 83 |
### Python Usage
|
| 84 |
|
|
@@ -92,12 +93,6 @@ run_inference(
|
|
| 92 |
)
|
| 93 |
```
|
| 94 |
|
| 95 |
-
For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
|
| 96 |
-
|
| 97 |
-
```python
|
| 98 |
-
run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
## Streaming Protocol
|
| 102 |
|
| 103 |
A single session looks like:
|
|
@@ -117,43 +112,24 @@ A single session looks like:
|
|
| 117 |
|
| 118 |
The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk.
|
| 119 |
|
| 120 |
-
## Training Summary
|
| 121 |
-
|
| 122 |
-
<!-- TODO: fill in once details are public.
|
| 123 |
-
Suggested fields:
|
| 124 |
-
- Pretraining base
|
| 125 |
-
- SFT / instruction-tuning data
|
| 126 |
-
- Streaming-objective data construction (how KEEP_SILENCE / TEXT_BEGIN supervision was generated)
|
| 127 |
-
- Total tokens / hours of audio
|
| 128 |
-
- Hardware and duration
|
| 129 |
-
-->
|
| 130 |
-
|
| 131 |
-
## Evaluation
|
| 132 |
-
|
| 133 |
-
<!-- TODO: fill in once benchmarks are decided.
|
| 134 |
-
Candidate metrics:
|
| 135 |
-
- Spoken-QA accuracy on held-out audio prompts
|
| 136 |
-
- False-trigger rate on ambient / non-speech audio (lower is better)
|
| 137 |
-
- Response-onset latency in encoder chunks from end of question
|
| 138 |
-
- Text quality of replies (e.g. GPT-judge or human preference)
|
| 139 |
-
-->
|
| 140 |
-
|
| 141 |
## Limitations
|
| 142 |
|
| 143 |
- The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
|
| 144 |
-
- Audio must be 16 kHz mono; non-conforming inputs are resampled
|
| 145 |
- Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
|
| 146 |
- Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
|
| 147 |
|
| 148 |
## Citation
|
| 149 |
|
| 150 |
-
<!-- TODO: replace with the real arxiv id and year once published. -->
|
| 151 |
```bibtex
|
| 152 |
-
@misc{
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
|
|
|
|
|
|
|
|
|
| 157 |
}
|
| 158 |
```
|
| 159 |
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- zhifeixie/StreamAudio-2M
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- zh
|
|
|
|
| 7 |
library_name: transformers
|
| 8 |
+
license: apache-2.0
|
| 9 |
pipeline_tag: audio-text-to-text
|
|
|
|
|
|
|
| 10 |
tags:
|
| 11 |
- speech-language-model
|
| 12 |
- streaming
|
|
|
|
| 14 |
- multimodal
|
| 15 |
- qwen2.5-omni
|
| 16 |
---
|
| 17 |
+
|
| 18 |
# Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
|
| 19 |
|
| 20 |
+
[**Project Page**](https://xzf-thu.github.io/Audio-Interaction/) | [**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/StreamAudio-2M) | [**Paper**](https://huggingface.co/papers/2606.05121)
|
| 21 |
|
| 22 |
+
Audio-Interaction is a unified streaming model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. It formalizes the "perceive-decide-respond" loop, allowing the model to handle conventional offline tasks (ASR, S2TT) while adding online capabilities like proactive intervention and real-time voice chatting.
|
| 23 |
|
| 24 |
+
The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
|
|
|
|
| 55 |
|
| 56 |
## Intended Use
|
| 57 |
|
| 58 |
+
Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow.
|
| 59 |
|
| 60 |
## Quick Start
|
| 61 |
|
| 62 |
### Installation
|
| 63 |
|
| 64 |
```bash
|
| 65 |
+
git clone https://github.com/xzf-thu/Audio-Interaction.git
|
| 66 |
cd Audio-Interaction
|
| 67 |
conda create -n Audio-Interaction python=3.10 -y
|
| 68 |
conda activate Audio-Interaction
|
|
|
|
| 79 |
snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
|
| 80 |
```
|
| 81 |
|
| 82 |
+
`snapshot_download` is the recommended path — it pulls every file and resumes on interruption.
|
| 83 |
|
| 84 |
### Python Usage
|
| 85 |
|
|
|
|
| 93 |
)
|
| 94 |
```
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
## Streaming Protocol
|
| 97 |
|
| 98 |
A single session looks like:
|
|
|
|
| 112 |
|
| 113 |
The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk.
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
## Limitations
|
| 116 |
|
| 117 |
- The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
|
| 118 |
+
- Audio must be 16 kHz mono; non-conforming inputs are resampled and padded to 0.4-second boundaries.
|
| 119 |
- Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
|
| 120 |
- Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
|
| 121 |
|
| 122 |
## Citation
|
| 123 |
|
|
|
|
| 124 |
```bibtex
|
| 125 |
+
@misc{xie2026audiointeractionmodel,
|
| 126 |
+
title={Audio Interaction Model},
|
| 127 |
+
author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
|
| 128 |
+
year={2026},
|
| 129 |
+
eprint={2606.05121},
|
| 130 |
+
archivePrefix={arXiv},
|
| 131 |
+
primaryClass={cs.SD},
|
| 132 |
+
url={https://arxiv.org/abs/2606.05121},
|
| 133 |
}
|
| 134 |
```
|
| 135 |
|