arxiv:2504.07402

LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models

Published on Apr 10, 2025

Authors:

Abstract

A two-stage language model approach combines an auto-regressive decoder-only model with an encoder-only model to extract target speech from mixed audio, achieving improved performance through coarse-to-fine prediction refinement.

AI-generated summary

We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. To refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.07402 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.07402 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.07402 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.