UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Xuangeng Chu*1โ
Ruicong Liu*1โ โ
Yifei Huang1โ
Yun Liu2โ
Yichen Peng3โ
Bo Zheng2
1Shanda AI Research Tokyo, The University of Tokyo,
2Shanda AI Research Tokyo,
3Institute of Science Tokyo
*Equal contribution,
โ Corresponding author
Installation
Clone the project
git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS
Build environment
conda env create -f environment.yml
conda activate unils
Or install manually:
pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb
Pretrained Models
Download the pretrained models from HuggingFace.
Data
Download the dataset from UniLS-Talk Dataset.
Training
UniLS follows a three-stage training pipeline:
Stage 1: Motion Codec (VAE)
python train.py -c unils_codec
Stage 2: Audio-Free Autoregressive Generator
Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:
python train.py -c unils_freegen
Stage 3: Audio-Conditioned LoRA Fine-tuning
Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:
python train.py -c unils_loragen
Evaluation
Run evaluation with multi-GPU support via Accelerate:
accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5
You can also pass an external dataset config to override the checkpoint's dataset:
accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml
Inference
From Dataset
Generate visualizations from the dataset:
python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
--resume_path, -r: Path to the trained model checkpoint.--dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).--clip_length: Duration of the generated clip in seconds (default: 20).--tau: Temperature for sampling (default: 1.0).--cfg: Classifier-free guidance scale (default: 1.5).--num_samples, -n: Number of samples to generate (default: 32).--dump_dir, -d: Output directory (default:./render_results).
From Audio Files
Generate visualizations directly from audio files, supporting one or two speakers:
# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav
# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
--resume_path, -r: Path to the trained model checkpoint.--audio, -a: Path to speaker 0 audio file.--audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).--tau: Temperature for sampling (default: 1.0).--cfg: Classifier-free guidance scale (default: 1.5).--dump_dir, -d: Output directory (default:./render_results).
Acknowledgements
Some part of our work is built based on FLAME. We also thank the following projects:
Citation
If you find our work useful in your research, please consider citing:
@misc{chu2025unils,
title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking},
author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
year={2025},
eprint={2512.09327},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.09327},
}