English

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

       

Xuangeng Chu*1โ€ƒ Ruicong Liu*1โ€ โ€ƒ Yifei Huang1โ€ƒ Yun Liu2โ€ƒ Yichen Peng3โ€ƒ Bo Zheng2
1Shanda AI Research Tokyo, The University of Tokyo, 2Shanda AI Research Tokyo, 3Institute of Science Tokyo
*Equal contribution, โ€ Corresponding author
UniLS generates diverse and natural listening and speaking motions from audio.

Installation

Clone the project

git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS

Build environment

conda env create -f environment.yml
conda activate unils

Or install manually:

pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb

Pretrained Models

Download the pretrained models from HuggingFace.

Data

Download the dataset from UniLS-Talk Dataset.

Training

UniLS follows a three-stage training pipeline:

Stage 1: Motion Codec (VAE)

python train.py -c unils_codec

Stage 2: Audio-Free Autoregressive Generator

Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:

python train.py -c unils_freegen

Stage 3: Audio-Conditioned LoRA Fine-tuning

Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:

python train.py -c unils_loragen

Evaluation

Run evaluation with multi-GPU support via Accelerate:

accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5

You can also pass an external dataset config to override the checkpoint's dataset:

accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml

Inference

From Dataset

Generate visualizations from the dataset:

python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
  • --resume_path, -r: Path to the trained model checkpoint.
  • --dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).
  • --clip_length: Duration of the generated clip in seconds (default: 20).
  • --tau: Temperature for sampling (default: 1.0).
  • --cfg: Classifier-free guidance scale (default: 1.5).
  • --num_samples, -n: Number of samples to generate (default: 32).
  • --dump_dir, -d: Output directory (default: ./render_results).

From Audio Files

Generate visualizations directly from audio files, supporting one or two speakers:

# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav

# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
  • --resume_path, -r: Path to the trained model checkpoint.
  • --audio, -a: Path to speaker 0 audio file.
  • --audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
  • --tau: Temperature for sampling (default: 1.0).
  • --cfg: Classifier-free guidance scale (default: 1.5).
  • --dump_dir, -d: Output directory (default: ./render_results).

Acknowledgements

Some part of our work is built based on FLAME. We also thank the following projects:

Citation

If you find our work useful in your research, please consider citing:

@misc{chu2025unils,
      title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking}, 
      author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
      year={2025},
      eprint={2512.09327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09327}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train xg-chu/UniLS

Paper for xg-chu/UniLS