LanDiff

🎬 Demo Page ｜ 🤗 Hugging Face | 🤖 ModelScope | 📑 Paper

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

In this repository, we present LanDiff, a novel text-to-video generation framework that synergizes the strengths of Language Models and Diffusion Models. LanDiff offers these key features:

👍 High Performance: LanDiff (5B) achieves a score of 85.43 on the VBench T2V benchmark, surpassing state-of-the-art open-source models like Hunyuan Video (13B) and demonstrating competitiveness with leading commercial models such as Sora, Kling, and Hailuo. It also achieves SOTA performance among open-source models for long video generation.
👍 Novel Hybrid Architecture: LanDiff pioneers a coarse-to-fine generation pipeline, integrating Language Models (for high-level semantics) and Diffusion Models (for high-fidelity visual details), effectively combining the advantages of both paradigms.
👍 Extreme Compression Semantic Tokenizer: Features an innovative video semantic tokenizer that compresses rich 3D visual features into compact 1D discrete representations using query tokens and frame grouping, achieving an exceptional ~14,000x compression ratio while preserving crucial semantic information.
👍 Efficient Long Video Generation: Implements a streaming diffusion model capable of generating long videos chunk-by-chunk, significantly reducing computational requirements and enabling scalable video synthesis.

Quickstart

Prerequisites

Python 3.10 (validated) or higher
PyTorch 2.5 (validated) or higher

Installation

Clone the repository

git clone https://github.com/LanDiff/LanDiff
cd LanDiff

Using UV

# Create environment
uv sync
# Install gradio for run local demo (Optional)
uv sync --extra gradio

Using Conda

# Create and activate Conda environment
conda create -n landiff python=3.10
conda activate landiff
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt
# Install gradio for run local demo (Optional)
pip install gradio==5.27.0

Model Download

Model	Download Link	Download Link
LanDiff	🤗 Huggingface	🤖 ModelScope

License

Code derived from CogVideo is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.

Citation

If you find our work helpful, please cite us.

@article{landiff,
  title={The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation},
  author={Yin, Aoxiong and Shen, Kai and Leng, Yichong and Tan, Xu and Zhou, Xinyu and Li, Juncheng and Tang, Siliang},
  journal={arXiv preprint arXiv:2503.04606},
  year={2025}
}

Acknowledgements

We would like to thank the contributors to the CogVideo, Theia, TiTok, flan-t5-xxl and HuggingFace repositories, for their open research.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for yinaoxiong/LanDiff

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Paper • 2503.04606 • Published Mar 6, 2025 • 9