The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Paper
β’
2503.04606
β’
Published
β’
9
π¬ Demo Page ο½ π€ Hugging Face | π€ ModelScope | π Paper
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
In this repository, we present LanDiff, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. LanDiff offers these key features:
git clone https://github.com/LanDiff/LanDiff
cd LanDiff
# Create environment
uv sync
# Install gradio for run local demo (Optional)
uv sync --extra gradio
# Create and activate Conda environment
conda create -n landiff python=3.10
conda activate landiff
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txt
# Install gradio for run local demo (Optional)
pip install gradio==5.27.0
| Model | Download Link | Download Link |
|---|---|---|
| LanDiff | π€ Huggingface | π€ ModelScope |
Code derived from CogVideo is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.
If you find our work helpful, please cite us.
@article{landiff,
title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
journal={arXiv preprint arXiv:2503.04606},
year={2025}
}
We would like to thank the contributors to the CogVideo, Theia, TiTok, flan-t5-xxl and HuggingFace repositories, for their open research.