MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
Abstract
Current full-duplex speech language models struggle with multi-round conversations due to inconsistent performance across different evaluation dimensions, necessitating comprehensive benchmarking.
Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench
Community
We present MTR-DuplexBench, the comprehensive benchmark for evaluating full-duplex speech language models across multi-round conversations. Our benchmark evaluates models on four critical dimensions: Conversational Features (smooth-turntaking,interruption, pause handling, background), Instruction Following, Safety, and Dialogue Quality. We evaluate several speech models and reveal significant gaps in their ability to handle real-world conversational dynamics. The dataset and evaluation code are publicly available at https://huggingface.co/datasets/Jeff0918/MTR-DuplexBench and https://github.com/ZhangHe0918/MTR-DuplexBench.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation (2026)
- DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization (2026)
- $\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains (2026)
- MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models (2026)
- Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models (2026)
- Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model (2026)
- CoDeTT: A Context-Aware Decision Benchmark for Turn-Taking Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2511.10262 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper