Update model card with official links, citation, and paper information

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +19 -43
README.md CHANGED
@@ -1,12 +1,12 @@
1
  ---
 
 
2
  language:
3
  - en
4
  - zh
5
- license: apache-2.0
6
  library_name: transformers
 
7
  pipeline_tag: audio-text-to-text
8
- datasets:
9
- - zhifeixie/StreamAudio-2M
10
  tags:
11
  - speech-language-model
12
  - streaming
@@ -14,13 +14,14 @@ tags:
14
  - multimodal
15
  - qwen2.5-omni
16
  ---
 
17
  # Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
18
 
19
- [**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
20
 
21
- Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
22
 
23
- This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
24
 
25
  ## Model Details
26
 
@@ -54,14 +55,14 @@ Audio-Interaction/
54
 
55
  ## Intended Use
56
 
57
- Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
58
 
59
  ## Quick Start
60
 
61
  ### Installation
62
 
63
  ```bash
64
- git clone https://github.com/xzf-thu/Audio-Interaction.git # TODO: confirm repo URL
65
  cd Audio-Interaction
66
  conda create -n Audio-Interaction python=3.10 -y
67
  conda activate Audio-Interaction
@@ -78,7 +79,7 @@ from huggingface_hub import snapshot_download
78
  snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
79
  ```
80
 
81
- `snapshot_download` is the recommended path — it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
82
 
83
  ### Python Usage
84
 
@@ -92,12 +93,6 @@ run_inference(
92
  )
93
  ```
94
 
95
- For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
96
-
97
- ```python
98
- run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
99
- ```
100
-
101
  ## Streaming Protocol
102
 
103
  A single session looks like:
@@ -117,43 +112,24 @@ A single session looks like:
117
 
118
  The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk.
119
 
120
- ## Training Summary
121
-
122
- <!-- TODO: fill in once details are public.
123
- Suggested fields:
124
- - Pretraining base
125
- - SFT / instruction-tuning data
126
- - Streaming-objective data construction (how KEEP_SILENCE / TEXT_BEGIN supervision was generated)
127
- - Total tokens / hours of audio
128
- - Hardware and duration
129
- -->
130
-
131
- ## Evaluation
132
-
133
- <!-- TODO: fill in once benchmarks are decided.
134
- Candidate metrics:
135
- - Spoken-QA accuracy on held-out audio prompts
136
- - False-trigger rate on ambient / non-speech audio (lower is better)
137
- - Response-onset latency in encoder chunks from end of question
138
- - Text quality of replies (e.g. GPT-judge or human preference)
139
- -->
140
-
141
  ## Limitations
142
 
143
  - The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
144
- - Audio must be 16 kHz mono; non-conforming inputs are resampled by `whisper.load_audio` and padded to 0.4-second boundaries before encoding.
145
  - Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
146
  - Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
147
 
148
  ## Citation
149
 
150
- <!-- TODO: replace with the real arxiv id and year once published. -->
151
  ```bibtex
152
- @misc{xie_miniomni3,
153
- title = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling},
154
- author = {Zhifei Xie and collaborators},
155
- year = {2026},
156
- note = {Preprint in preparation}
 
 
 
157
  }
158
  ```
159
 
 
1
  ---
2
+ datasets:
3
+ - zhifeixie/StreamAudio-2M
4
  language:
5
  - en
6
  - zh
 
7
  library_name: transformers
8
+ license: apache-2.0
9
  pipeline_tag: audio-text-to-text
 
 
10
  tags:
11
  - speech-language-model
12
  - streaming
 
14
  - multimodal
15
  - qwen2.5-omni
16
  ---
17
+
18
  # Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
19
 
20
+ [**Project Page**](https://xzf-thu.github.io/Audio-Interaction/) | [**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/StreamAudio-2M) | [**Paper**](https://huggingface.co/papers/2606.05121)
21
 
22
+ Audio-Interaction is a unified streaming model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. It formalizes the "perceive-decide-respond" loop, allowing the model to handle conventional offline tasks (ASR, S2TT) while adding online capabilities like proactive intervention and real-time voice chatting.
23
 
24
+ The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
25
 
26
  ## Model Details
27
 
 
55
 
56
  ## Intended Use
57
 
58
+ Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow.
59
 
60
  ## Quick Start
61
 
62
  ### Installation
63
 
64
  ```bash
65
+ git clone https://github.com/xzf-thu/Audio-Interaction.git
66
  cd Audio-Interaction
67
  conda create -n Audio-Interaction python=3.10 -y
68
  conda activate Audio-Interaction
 
79
  snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
80
  ```
81
 
82
+ `snapshot_download` is the recommended path — it pulls every file and resumes on interruption.
83
 
84
  ### Python Usage
85
 
 
93
  )
94
  ```
95
 
 
 
 
 
 
 
96
  ## Streaming Protocol
97
 
98
  A single session looks like:
 
112
 
113
  The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk.
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ## Limitations
116
 
117
  - The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
118
+ - Audio must be 16 kHz mono; non-conforming inputs are resampled and padded to 0.4-second boundaries.
119
  - Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
120
  - Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
121
 
122
  ## Citation
123
 
 
124
  ```bibtex
125
+ @misc{xie2026audiointeractionmodel,
126
+ title={Audio Interaction Model},
127
+ author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},
128
+ year={2026},
129
+ eprint={2606.05121},
130
+ archivePrefix={arXiv},
131
+ primaryClass={cs.SD},
132
+ url={https://arxiv.org/abs/2606.05121},
133
  }
134
  ```
135