UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu^1 Ruicong Liu^1† Yifei Huang¹ Yun Liu² Yichen Peng³ Bo Zheng²
¹Shanda AI Research Tokyo, The University of Tokyo, ²Shanda AI Research Tokyo, ³Institute of Science Tokyo
^*Equal contribution, ^†Corresponding author

UniLS generates diverse and natural listening and speaking motions from audio.

Installation

Clone the project

git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS

Build environment

conda env create -f environment.yml
conda activate unils

Or install manually:

pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb

Pretrained Models

Download the pretrained models from HuggingFace.

Data

Download the dataset from UniLS-Talk Dataset.

Training

UniLS follows a three-stage training pipeline:

Stage 1: Motion Codec (VAE)

python train.py -c unils_codec

Stage 2: Audio-Free Autoregressive Generator

Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:

python train.py -c unils_freegen

Stage 3: Audio-Conditioned LoRA Fine-tuning

Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:

python train.py -c unils_loragen

Evaluation

Run evaluation with multi-GPU support via Accelerate:

accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5

You can also pass an external dataset config to override the checkpoint's dataset:

accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml

Inference

From Dataset

Generate visualizations from the dataset:

python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32

--resume_path, -r: Path to the trained model checkpoint.
--dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).
--clip_length: Duration of the generated clip in seconds (default: 20).
--tau: Temperature for sampling (default: 1.0).
--cfg: Classifier-free guidance scale (default: 1.5).
--num_samples, -n: Number of samples to generate (default: 32).
--dump_dir, -d: Output directory (default: ./render_results).

From Audio Files

Generate visualizations directly from audio files, supporting one or two speakers:

# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav

# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav

--resume_path, -r: Path to the trained model checkpoint.
--audio, -a: Path to speaker 0 audio file.
--audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
--tau: Temperature for sampling (default: 1.0).
--cfg: Classifier-free guidance scale (default: 1.5).
--dump_dir, -d: Output directory (default: ./render_results).

Acknowledgements

Some part of our work is built based on FLAME. We also thank the following projects:

FLAME: https://flame.is.tue.mpg.de
EMICA: https://github.com/radekd91/inferno

Citation

If you find our work useful in your research, please consider citing:

@misc{chu2025unils,
      title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking}, 
      author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
      year={2025},
      eprint={2512.09327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09327}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train xg-chu/UniLS

Paper for xg-chu/UniLS

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Paper • 2512.09327 • Published Dec 10, 2025

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu*1 Ruicong Liu*1† Yifei Huang1 Yun Liu2 Yichen Peng3 Bo Zheng2 1Shanda AI Research Tokyo, The University of Tokyo, 2Shanda AI Research Tokyo, 3Institute of Science Tokyo *Equal contribution, †Corresponding author

Installation

Clone the project

Build environment

Pretrained Models

Data

Training

Evaluation

Inference

From Dataset

From Audio Files

Acknowledgements

Citation

Dataset used to train xg-chu/UniLS

Paper for xg-chu/UniLS

Xuangeng Chu^1 Ruicong Liu^1† Yifei Huang¹ Yun Liu² Yichen Peng³ Bo Zheng²
¹Shanda AI Research Tokyo, The University of Tokyo, ²Shanda AI Research Tokyo, ³Institute of Science Tokyo
^*Equal contribution, ^†Corresponding author