DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
Hongfei Zhang 1*, Kanghao Chen 1,5*, Zixin Zhang 1,5, Harold H. Chen 1,5, Yuanhuiyi Lyu 1, Yuqi Zhang 3, Shuai Yang 1, Kun Zhou 4, Ying-Cong Chen 1,2,โ
1 HKUST(GZ) 2 HKUST 3 Fudan University 4 Shenzhen University 5 Knowin
* Equal Contribution. โCorresponding author.
๐งฉ Contents
1. ๐ฐ News
2. โ๏ธ TODO
3. ๐ฏ Overview
4. ๐ง Installation
5. ๐ฎ Inference
6. ๐ฅ Training
๐ฐ News
โ 2025.11 โ Released inference pipeline & demo dataset โ๏ธ
โ 2025.11 โ Uploaded official DualCamCtrl checkpoints to HuggingFace ๐
โ๏ธ TODO
โฌ Release the training code ๐
๐ฏ Overview
Abstract
This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the SemantIc Guided Mutual Alignment (SIGMA) mechanism, which performs RGBโdepth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation with over 40% reduction on camera motion errors compared with prior methods.
Results

Comparison between our method and other state-of-the-art approaches. Given the same camera pose and input image as generation conditions, our method achieves the best alignment between camera motion and scene dynamics, producing the most visually accurate video. The โ+โ signs marked in the figure serve as anchors for visual comparison.

Quantitative comparisons on I2V setting. โ / โ denotes higher/lower is better. Best and second best results highlighted.

Quantitative comparisons on T2V setting across REALESTATE10K and DL3DV.
๐ง Installation
Clone repo and create an enviroment with Python 3.11:
git clone https://github.com/soyouthinkyoucantell/DualCamCtrl.git
conda create -n dualcamctrl python=3.11
conda activate dualcamctrl
Install DiffSynth-Studio dependencies from source code:
cd DualCamCtrl
pip install -e .
Then install GenFusion dependencies:
mkdir dependency
cd dependency
git clone https://github.com/rmbrualla/pycolmap.git
cd pycolmap
pip install -e .
pip install numpy==1.26.4 peft accelerate==1.9.0 decord==0.6.0 deepspeed diffusers omegaconf
๐ฎ Inference
Checkpoints
Get the checkpoints from the HuggingFace repo: DualCamCtrl Checkpoints
Put it the checkpoints dir
cd ../.. # make sure you are at the root dir
Your project structure should be like
DualCamCtrl/
โโโ checkpoints/ # โ Put downloaded .pt here
โ โโโ dualcamctrl_diffusion_transformer.pt
โโโ demo_dataset/ # Small demo dataset strcture
โโโ demo_pic/ # Demo images for quick inference
โโโ diffsynth/
โโโ examples/
โโโ ....
โโโ requirements.txt
โโโ README.md
โโโ setup.py
Test with our demo pictures and depth:
cd .. # make sure you are at the root dir
export PYTHONPATH=.
python -m test_script.test_demo