hard56961 commited on
Commit
91600c1
·
verified ·
1 Parent(s): 4e2a94a

Upload 8 files

Browse files
Files changed (8) hide show
  1. .gitignore +20 -0
  2. .pre-commit-config.yaml +25 -0
  3. LICENSE.txt +38 -0
  4. README.md +966 -0
  5. app.py +63 -0
  6. generate_video.py +161 -0
  7. generate_video_df.py +220 -0
  8. requirements.txt +16 -0
.gitignore ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ checkpoint/*
3
+ checkpoint
4
+ results/*
5
+ .DS_Store
6
+ results/*
7
+ *.png
8
+ *.jpg
9
+ *.mp4
10
+ *.log*
11
+ *.json
12
+ scripts/transformer/*
13
+ compile_cache
14
+ scripts/.gradio/*
15
+ *.pkl
16
+ # *.csv
17
+ *.jsonl
18
+ out/*
19
+ model/
20
+ run.sh
.pre-commit-config.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/asottile/reorder-python-imports.git
3
+ rev: v3.8.3
4
+ hooks:
5
+ - id: reorder-python-imports
6
+ name: Reorder Python imports
7
+ types: [file, python]
8
+ - repo: https://github.com/psf/black.git
9
+ rev: 22.8.0
10
+ hooks:
11
+ - id: black
12
+ additional_dependencies: ['click==8.0.4']
13
+ args: [--line-length=120]
14
+ types: [file, python]
15
+ - repo: https://github.com/pre-commit/pre-commit-hooks.git
16
+ rev: v4.3.0
17
+ hooks:
18
+ - id: check-byte-order-marker
19
+ types: [file, python]
20
+ - id: trailing-whitespace
21
+ types: [file, python]
22
+ - id: end-of-file-fixer
23
+ types: [file, python]
24
+
25
+
LICENSE.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: other
6
+ tasks:
7
+ - text-generation
8
+
9
+ ---
10
+
11
+ <!-- markdownlint-disable first-line-h1 -->
12
+ <!-- markdownlint-disable html -->
13
+
14
+ # <span id="Terms">声明与协议/Terms and Conditions</span>
15
+
16
+ ## 声明
17
+
18
+ 我们在此声明,不要利用Skywork模型进行任何危害国家社会安全或违法的活动。另外,我们也要求使用者不要将 Skywork 模型用于未经适当安全审查和备案的互联网服务。我们希望所有的使用者都能遵守这个原则,确保科技的发展能在规范和合法的环境下进行。
19
+
20
+ 我们已经尽我们所能,来确保模型训练过程中使用的数据的合规性。然而,尽管我们已经做出了巨大的努力,但由于模型和数据的复杂性,仍有可能存在一些无法预见的问题。因此,如果由于使用skywork开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。
21
+
22
+ We hereby declare that the Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Additionally, we request users not to deploy the Skywork model for internet services without appropriate security reviews and records. We hope that all users will adhere to this principle to ensure that technological advancements occur in a regulated and lawful environment.
23
+
24
+ We have done our utmost to ensure the compliance of the data used during the model's training process. However, despite our extensive efforts, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, if any problems arise as a result of using the Skywork open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized, we will not assume any responsibility.
25
+
26
+ ## 协议
27
+
28
+ 社区使用Skywork模型需要遵循[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)。Skywork模型支持商业用途,如果您计划将Skywork模型或其衍生品用于商业目的,无需再次申请, 但请您仔细阅读[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)并严格遵守相关条款。
29
+
30
+
31
+ The community usage of Skywork model requires [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf). The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf).
32
+
33
+
34
+
35
+ [《Skywork 模型社区许可协议》》]:https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf
36
+
37
+
38
+ [skywork-opensource@kunlun-inc.com]: mailto:skywork-opensource@kunlun-inc.com
README.md ADDED
@@ -0,0 +1,966 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="assets/logo2.png" alt="SkyReels Logo" width="50%">
3
+ </p>
4
+
5
+ <h1 align="center">SkyReels V2: Infinite-Length Film Generative Model</h1>
6
+
7
+ <p align="center">
8
+ 📑 <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a> · 👋 <a href="https://www.skyreels.ai/home?utm_campaign=github_SkyReels_V2" target="_blank">Playground</a> · 💬 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a> · 🤗 <a href="https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9" target="_blank">Hugging Face</a> · 🤖 <a href="https://www.modelscope.cn/collections/SkyReels-V2-f665650130b144" target="_blank">ModelScope</a>
9
+ </p>
10
+
11
+ ---
12
+ Welcome to the **SkyReels V2** repository! Here, you'll find the model weights and inference code for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing **AutoRegressive Diffusion-Forcing architecture** that achieves the **SOTA performance** among publicly available models.
13
+
14
+
15
+ ## 🔥🔥🔥 News!!
16
+ * Jun 1, 2025: 🎉 We published the technical report, [SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers](https://arxiv.org/pdf/2506.00830).
17
+ * May 16, 2025: 🔥 We release the inference code for [video extension](#ve) and [start/end frame control](#se) in diffusion forcing model.
18
+ * Apr 24, 2025: 🔥 We release the 720P models, [SkyReels-V2-DF-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P) and [SkyReels-V2-I2V-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P). The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.
19
+ * Apr 21, 2025: 👋 We release the inference code and model weights of [SkyReels-V2](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) Series Models and the video captioning model [SkyCaptioner-V1](https://huggingface.co/Skywork/SkyCaptioner-V1) .
20
+ * Apr 3, 2025: 🔥 We also release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
21
+ * Feb 18, 2025: 🔥 we released [SkyReels-A1](https://github.com/SkyworkAI/SkyReels-A1). This is an open-sourced and effective framework for portrait image animation.
22
+ * Feb 18, 2025: 🔥 We released [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1). This is the first and most advanced open-source human-centric video foundation model.
23
+
24
+ ## 🎥 Demos
25
+ <table>
26
+ <tr>
27
+ <td align="center">
28
+ <video src="https://github.com/user-attachments/assets/f6f9f9a7-5d5f-433c-9d73-d8d593b7ad25" width="100%"></video>
29
+ </td>
30
+ <td align="center">
31
+ <video src="https://github.com/user-attachments/assets/0eb13415-f4d9-4aaf-bcd3-3031851109b9" width="100%"></video>
32
+ </td>
33
+ <td align="center">
34
+ <video src="https://github.com/user-attachments/assets/dcd16603-5bf4-4786-8e4d-1ed23889d07a" width="100%"></video>
35
+ </td>
36
+ </tr>
37
+ </table>
38
+ The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.
39
+
40
+
41
+ ## 📑 TODO List
42
+
43
+ - [x] <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a>
44
+ - [x] Checkpoints of the 14B and 1.3B Models Series
45
+ - [x] Single-GPU & Multi-GPU Inference Code
46
+ - [x] <a href="https://huggingface.co/Skywork/SkyCaptioner-V1">SkyCaptioner-V1</a>: A Video Captioning Model
47
+ - [x] Prompt Enhancer
48
+ - [x] Diffusers integration
49
+ - [ ] Checkpoints of the 5B Models Series
50
+ - [ ] Checkpoints of the Camera Director Models
51
+ - [ ] Checkpoints of the Step & Guidance Distill Model
52
+
53
+
54
+ ## 🚀 Quickstart
55
+
56
+ #### Installation
57
+ ```shell
58
+ # clone the repository.
59
+ git clone https://github.com/SkyworkAI/SkyReels-V2
60
+ cd SkyReels-V2
61
+ # Install dependencies. Test environment uses Python 3.10.12.
62
+ pip install -r requirements.txt
63
+ ```
64
+
65
+ #### Model Download
66
+ You can download our models from Hugging Face:
67
+ <table>
68
+ <thead>
69
+ <tr>
70
+ <th>Type</th>
71
+ <th>Model Variant</th>
72
+ <th>Recommended Height/Width/Frame</th>
73
+ <th>Link</th>
74
+ </tr>
75
+ </thead>
76
+ <tbody>
77
+ <tr>
78
+ <td rowspan="5">Diffusion Forcing</td>
79
+ <td>1.3B-540P</td>
80
+ <td>544 * 960 * 97f</td>
81
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-1.3B-540P">ModelScope</a></td>
82
+ </tr>
83
+ <tr>
84
+ <td>5B-540P</td>
85
+ <td>544 * 960 * 97f</td>
86
+ <td>Coming Soon</td>
87
+ </tr>
88
+ <tr>
89
+ <td>5B-720P</td>
90
+ <td>720 * 1280 * 121f</td>
91
+ <td>Coming Soon</td>
92
+ </tr>
93
+ <tr>
94
+ <td>14B-540P</td>
95
+ <td>544 * 960 * 97f</td>
96
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-14B-540P">ModelScope</a></td>
97
+ </tr>
98
+ <tr>
99
+ <td>14B-720P</td>
100
+ <td>720 * 1280 * 121f</td>
101
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-14B-720P">ModelScope</a></td>
102
+ </tr>
103
+ <tr>
104
+ <td rowspan="5">Text-to-Video</td>
105
+ <td>1.3B-540P</td>
106
+ <td>544 * 960 * 97f</td>
107
+ <td>Coming Soon</td>
108
+ </tr>
109
+ <tr>
110
+ <td>5B-540P</td>
111
+ <td>544 * 960 * 97f</td>
112
+ <td>Coming Soon</td>
113
+ </tr>
114
+ <tr>
115
+ <td>5B-720P</td>
116
+ <td>720 * 1280 * 121f</td>
117
+ <td>Coming Soon</td>
118
+ </tr>
119
+ <tr>
120
+ <td>14B-540P</td>
121
+ <td>544 * 960 * 97f</td>
122
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-T2V-14B-540P">ModelScope</a></td>
123
+ </tr>
124
+ <tr>
125
+ <td>14B-720P</td>
126
+ <td>720 * 1280 * 121f</td>
127
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-T2V-14B-720P">ModelScope</a></td>
128
+ </tr>
129
+ <tr>
130
+ <td rowspan="5">Image-to-Video</td>
131
+ <td>1.3B-540P</td>
132
+ <td>544 * 960 * 97f</td>
133
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-1.3B-540P">ModelScope</a></td>
134
+ </tr>
135
+ <tr>
136
+ <td>5B-540P</td>
137
+ <td>544 * 960 * 97f</td>
138
+ <td>Coming Soon</td>
139
+ </tr>
140
+ <tr>
141
+ <td>5B-720P</td>
142
+ <td>720 * 1280 * 121f</td>
143
+ <td>Coming Soon</td>
144
+ </tr>
145
+ <tr>
146
+ <td>14B-540P</td>
147
+ <td>544 * 960 * 97f</td>
148
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-14B-540P">ModelScope</a></td>
149
+ </tr>
150
+ <tr>
151
+ <td>14B-720P</td>
152
+ <td>720 * 1280 * 121f</td>
153
+ <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-14B-720P">ModelScope</a></td>
154
+ </tr>
155
+ <tr>
156
+ <td rowspan="3">Camera Director</td>
157
+ <td>5B-540P</td>
158
+ <td>544 * 960 * 97f</td>
159
+ <td>Coming Soon</td>
160
+ </tr>
161
+ <tr>
162
+ <td>5B-720P</td>
163
+ <td>720 * 1280 * 121f</td>
164
+ <td>Coming Soon</td>
165
+ </tr>
166
+ <tr>
167
+ <td>14B-720P</td>
168
+ <td>720 * 1280 * 121f</td>
169
+ <td>Coming Soon</td>
170
+ </tr>
171
+ </tbody>
172
+ </table>
173
+
174
+ After downloading, set the model path in your generation commands:
175
+
176
+
177
+ #### Single GPU Inference
178
+
179
+ - **Diffusion Forcing for Long Video Generation**
180
+
181
+ The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.
182
+
183
+ synchronous generation for 10s video
184
+ ```shell
185
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
186
+ # synchronous inference
187
+ python3 generate_video_df.py \
188
+ --model_id ${model_id} \
189
+ --resolution 540P \
190
+ --ar_step 0 \
191
+ --base_num_frames 97 \
192
+ --num_frames 257 \
193
+ --overlap_history 17 \
194
+ --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
195
+ --addnoise_condition 20 \
196
+ --offload \
197
+ --teacache \
198
+ --use_ret_steps \
199
+ --teacache_thresh 0.3
200
+ ```
201
+
202
+ asynchronous generation for 30s video
203
+ ```shell
204
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
205
+ # asynchronous inference
206
+ python3 generate_video_df.py \
207
+ --model_id ${model_id} \
208
+ --resolution 540P \
209
+ --ar_step 5 \
210
+ --causal_block_size 5 \
211
+ --base_num_frames 97 \
212
+ --num_frames 737 \
213
+ --overlap_history 17 \
214
+ --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
215
+ --addnoise_condition 20 \
216
+ --offload
217
+ ```
218
+
219
+ Text-to-video with `diffusers`:
220
+ ```py
221
+ import torch
222
+ from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler
223
+ from diffusers.utils import export_to_video
224
+
225
+ vae = AutoModel.from_pretrained("Skywork/SkyReels-V2-DF-14B-540P-Diffusers", subfolder="vae", torch_dtype=torch.float32)
226
+
227
+ pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
228
+ "Skywork/SkyReels-V2-DF-14B-540P-Diffusers",
229
+ vae=vae,
230
+ torch_dtype=torch.bfloat16
231
+ )
232
+ flow_shift = 8.0 # 8.0 for T2V, 5.0 for I2V
233
+ pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
234
+ pipeline = pipeline.to("cuda")
235
+
236
+ prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
237
+
238
+ output = pipeline(
239
+ prompt=prompt,
240
+ num_inference_steps=30,
241
+ height=544, # 720 for 720P
242
+ width=960, # 1280 for 720P
243
+ num_frames=97,
244
+ base_num_frames=97, # 121 for 720P
245
+ ar_step=5, # Controls asynchronous inference (0 for synchronous mode)
246
+ causal_block_size=5, # Number of frames in each block for asynchronous processing
247
+ overlap_history=None, # Number of frames to overlap for smooth transitions in long videos; 17 for long video generations
248
+ addnoise_condition=20, # Improves consistency in long video generation
249
+ ).frames[0]
250
+ export_to_video(output, "T2V.mp4", fps=24, quality=8)
251
+ ```
252
+
253
+ Image-to-video with `diffusers`:
254
+ ```py
255
+ import numpy as np
256
+ import torch
257
+ import torchvision.transforms.functional as TF
258
+ from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler
259
+ from diffusers.utils import export_to_video, load_image
260
+
261
+ model_id = "Skywork/SkyReels-V2-DF-14B-720P-Diffusers"
262
+ vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
263
+ pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
264
+ model_id, vae=vae, torch_dtype=torch.bfloat16
265
+ )
266
+ flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V
267
+ pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
268
+ pipeline.to("cuda")
269
+
270
+ first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
271
+ last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")
272
+
273
+ def aspect_ratio_resize(image, pipeline, max_area=720 * 1280):
274
+ aspect_ratio = image.height / image.width
275
+ mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1]
276
+ height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
277
+ width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
278
+ image = image.resize((width, height))
279
+ return image, height, width
280
+
281
+ def center_crop_resize(image, height, width):
282
+ # Calculate resize ratio to match first frame dimensions
283
+ resize_ratio = max(width / image.width, height / image.height)
284
+
285
+ # Resize the image
286
+ width = round(image.width * resize_ratio)
287
+ height = round(image.height * resize_ratio)
288
+ size = [width, height]
289
+ image = TF.center_crop(image, size)
290
+
291
+ return image, height, width
292
+
293
+ first_frame, height, width = aspect_ratio_resize(first_frame, pipeline)
294
+ if last_frame.size != first_frame.size:
295
+ last_frame, _, _ = center_crop_resize(last_frame, height, width)
296
+
297
+ prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
298
+
299
+ output = pipeline(
300
+ image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.0
301
+ ).frames[0]
302
+ export_to_video(output, "output.mp4", fps=24, quality=8)
303
+ ```
304
+
305
+ > **Note**:
306
+ > - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)**-like prompt which includes some descriptions of the first-frame image.
307
+ > - For long video generation, you can just switch the `--num_frames`, e.g., `--num_frames 257` for 10s video, `--num_frames 377` for 15s video, `--num_frames 737` for 30s video, `--num_frames 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causal_block_size > 1, the `--num_frames` should be carefully set.
308
+ > - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for base_num_frames=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for base_num_frames=97, num_frames=237, overlap_history=17) for the last iteration, MUST be divided by causal_block_size. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance.
309
+ > - To reduce peak VRAM, just lower the `--base_num_frames`, e.g., to 77 or 57, while keeping the same generative length `--num_frames` you want to generate. This may slightly reduce video quality, and it should not be set too small.
310
+ > - `--addnoise_condition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.
311
+ > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.
312
+
313
+ - **<span id="ve">Video Extention</span>**
314
+ ```shell
315
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
316
+ # video extention
317
+ python3 generate_video_df.py \
318
+ --model_id ${model_id} \
319
+ --resolution 540P \
320
+ --ar_step 0 \
321
+ --base_num_frames 97 \
322
+ --num_frames 120 \
323
+ --overlap_history 17 \
324
+ --prompt ${prompt} \
325
+ --addnoise_condition 20 \
326
+ --offload \
327
+ --use_ret_steps \
328
+ --teacache \
329
+ --teacache_thresh 0.3 \
330
+ --video_path ${video_path}
331
+ ```
332
+ > **Note**:
333
+ > - When performing video extension, you need to pass the `--video_path ${video_path}` parameter to specify the video to be extended.
334
+
335
+ - **<span id="se">Start/End Frame Control</span>**
336
+ ```shell
337
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
338
+ # start/end frame control
339
+ python3 generate_video_df.py \
340
+ --model_id ${model_id} \
341
+ --resolution 540P \
342
+ --ar_step 0 \
343
+ --base_num_frames 97 \
344
+ --num_frames 97 \
345
+ --overlap_history 17 \
346
+ --prompt ${prompt} \
347
+ --addnoise_condition 20 \
348
+ --offload \
349
+ --use_ret_steps \
350
+ --teacache \
351
+ --teacache_thresh 0.3 \
352
+ --image ${image} \
353
+ --end_image ${end_image}
354
+ ```
355
+ > **Note**:
356
+ > - When controlling the start and end frames, you need to pass the `--image ${image}` parameter to control the generation of the start frame and the `--end_image ${end_image}` parameter to control the generation of the end frame.
357
+
358
+ Video extension with `diffusers`:
359
+ ```py
360
+ import numpy as np
361
+ import torch
362
+ import torchvision.transforms.functional as TF
363
+ from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline, UniPCMultistepScheduler
364
+ from diffusers.utils import export_to_video, load_video
365
+
366
+ model_id = "Skywork/SkyReels-V2-DF-14B-540P-Diffusers"
367
+ vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
368
+ pipeline = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
369
+ model_id, vae=vae, torch_dtype=torch.bfloat16
370
+ )
371
+ flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V
372
+ pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift)
373
+ pipeline.to("cuda")
374
+
375
+ video = load_video("input_video.mp4")
376
+
377
+ prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
378
+
379
+ output = pipeline(
380
+ video=video, prompt=prompt, height=544, width=960, guidance_scale=5.0,
381
+ num_inference_steps=30, num_frames=257, base_num_frames=97#, ar_step=5, causal_block_size=5,
382
+ ).frames[0]
383
+ export_to_video(output, "output.mp4", fps=24, quality=8)
384
+ # Total frames will be the number of frames of given video + 257
385
+ ```
386
+
387
+ - **Text To Video & Image To Video**
388
+
389
+ ```shell
390
+ # run Text-to-Video Generation
391
+ model_id=Skywork/SkyReels-V2-T2V-14B-540P
392
+ python3 generate_video.py \
393
+ --model_id ${model_id} \
394
+ --resolution 540P \
395
+ --num_frames 97 \
396
+ --guidance_scale 6.0 \
397
+ --shift 8.0 \
398
+ --fps 24 \
399
+ --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
400
+ --offload \
401
+ --teacache \
402
+ --use_ret_steps \
403
+ --teacache_thresh 0.3
404
+ ```
405
+ > **Note**:
406
+ > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
407
+ > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM.
408
+
409
+ T2V models with `diffusers`:
410
+ ```py
411
+ import torch
412
+ from diffusers import (
413
+ SkyReelsV2Pipeline,
414
+ UniPCMultistepScheduler,
415
+ AutoencoderKLWan,
416
+ )
417
+ from diffusers.utils import export_to_video
418
+
419
+ # Load the pipeline
420
+ # Available models:
421
+ # - Skywork/SkyReels-V2-T2V-14B-540P-Diffusers
422
+ # - Skywork/SkyReels-V2-T2V-14B-720P-Diffusers
423
+ vae = AutoencoderKLWan.from_pretrained(
424
+ "Skywork/SkyReels-V2-T2V-14B-720P-Diffusers",
425
+ subfolder="vae",
426
+ torch_dtype=torch.float32,
427
+ )
428
+ pipe = SkyReelsV2Pipeline.from_pretrained(
429
+ "Skywork/SkyReels-V2-T2V-14B-720P-Diffusers",
430
+ vae=vae,
431
+ torch_dtype=torch.bfloat16,
432
+ )
433
+ flow_shift = 8.0 # 8.0 for T2V, 5.0 for I2V
434
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
435
+ pipe = pipe.to("cuda")
436
+
437
+ prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
438
+
439
+ output = pipe(
440
+ prompt=prompt,
441
+ num_inference_steps=50,
442
+ height=544,
443
+ width=960,
444
+ guidance_scale=6.0, # 6.0 for T2V, 5.0 for I2V
445
+ num_frames=97,
446
+ ).frames[0]
447
+ export_to_video(output, "video.mp4", fps=24, quality=8)
448
+ ```
449
+
450
+ I2V models with `diffusers`:
451
+ ```py
452
+ import torch
453
+ from diffusers import (
454
+ SkyReelsV2ImageToVideoPipeline,
455
+ UniPCMultistepScheduler,
456
+ AutoencoderKLWan,
457
+ )
458
+ from diffusers.utils import export_to_video
459
+ from PIL import Image
460
+
461
+ # Load the pipeline
462
+ # Available models:
463
+ # - Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers
464
+ # - Skywork/SkyReels-V2-I2V-14B-540P-Diffusers
465
+ # - Skywork/SkyReels-V2-I2V-14B-720P-Diffusers
466
+ vae = AutoencoderKLWan.from_pretrained(
467
+ "Skywork/SkyReels-V2-I2V-14B-720P-Diffusers",
468
+ subfolder="vae",
469
+ torch_dtype=torch.float32,
470
+ )
471
+ pipe = SkyReelsV2ImageToVideoPipeline.from_pretrained(
472
+ "Skywork/SkyReels-V2-I2V-14B-720P-Diffusers",
473
+ vae=vae,
474
+ torch_dtype=torch.bfloat16,
475
+ )
476
+ flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V
477
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
478
+ pipe = pipe.to("cuda")
479
+
480
+ prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
481
+ image = Image.open("path/to/image.png")
482
+
483
+ output = pipe(
484
+ image=image,
485
+ prompt=prompt,
486
+ num_inference_steps=50,
487
+ height=544,
488
+ width=960,
489
+ guidance_scale=5.0, # 6.0 for T2V, 5.0 for I2V
490
+ num_frames=97,
491
+ ).frames[0]
492
+ export_to_video(output, "video.mp4", fps=24, quality=8)
493
+ ```
494
+
495
+ - **Prompt Enhancer**
496
+
497
+ The prompt enhancer is implemented based on <a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">Qwen2.5-32B-Instruct</a> and is utilized via the `--prompt_enhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--prompt_enhancer`. If you want to obtain the enhanced prompt separately, you can also run the prompt_enhancer script separately for testing. The steps are as follows:
498
+
499
+ ```shell
500
+ cd skyreels_v2_infer/pipelines
501
+ python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
502
+ ```
503
+ > **Note**:
504
+ > - `--prompt_enhancer` is not allowed if using `--use_usp`. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the `--use_usp` parameter.
505
+
506
+
507
+ **Advanced Configuration Options**
508
+
509
+ Below are the key parameters you can customize for video generation:
510
+
511
+ | Parameter | Recommended Value | Description |
512
+ |:----------------------:|:---------:|:-----------------------------------------:|
513
+ | --prompt | | Text description for generating your video |
514
+ | --image | | Path to input image for image-to-video generation |
515
+ | --resolution | 540P or 720P | Output video resolution (select based on model type) |
516
+ | --num_frames | 97 or 121 | Total frames to generate (**97 for 540P models**, **121 for 720P models**) |
517
+ | --inference_steps | 50 | Number of denoising steps |
518
+ | --fps | 24 | Frames per second in the output video |
519
+ | --shift | 8.0 or 5.0 | Flow matching scheduler parameter (**8.0 for T2V**, **5.0 for I2V**) |
520
+ | --guidance_scale | 6.0 or 5.0 | Controls text adherence strength (**6.0 for T2V**, **5.0 for I2V**) |
521
+ | --seed | | Fixed seed for reproducible results (omit for random generation) |
522
+ | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
523
+ | --use_usp | True | Enables multi-GPU acceleration with xDiT USP |
524
+ | --outdir | ./video_out | Directory where generated videos will be saved |
525
+ | --prompt_enhancer | True | Expand the prompt into a more detailed description |
526
+ | --teacache | False | Enables teacache for faster inference |
527
+ | --teacache_thresh | 0.2 | Higher speedup will cause to worse quality |
528
+ | --use_ret_steps | False | Retention Steps for teacache |
529
+
530
+ **Diffusion Forcing Additional Parameters**
531
+ | Parameter | Recommended Value | Description |
532
+ |:----------------------:|:---------:|:-----------------------------------------:|
533
+ | --ar_step | 0 | Controls asynchronous inference (0 for synchronous mode) |
534
+ | --base_num_frames | 97 or 121 | Base frame count (**97 for 540P**, **121 for 720P**) |
535
+ | --overlap_history | 17 | Number of frames to overlap for smooth transitions in long videos |
536
+ | --addnoise_condition | 20 | Improves consistency in long video generation |
537
+ | --causal_block_size | 5 | Recommended when using asynchronous inference (--ar_step > 0) |
538
+ --video_path | | Path to input video for video extension |
539
+ --end_image | | Path to input image for end frame control |
540
+
541
+ #### Multi-GPU inference using xDiT USP
542
+
543
+ We use [xDiT](https://github.com/xdit-project/xDiT) USP to accelerate inference. For example, to generate a video with 2 GPUs, you can use the following command:
544
+ - **Diffusion Forcing**
545
+ ```shell
546
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
547
+ # diffusion forcing synchronous inference
548
+ torchrun --nproc_per_node=2 generate_video_df.py \
549
+ --model_id ${model_id} \
550
+ --resolution 540P \
551
+ --ar_step 0 \
552
+ --base_num_frames 97 \
553
+ --num_frames 257 \
554
+ --overlap_history 17 \
555
+ --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
556
+ --addnoise_condition 20 \
557
+ --use_usp \
558
+ --offload \
559
+ --seed 42
560
+ ```
561
+ - **Text To Video & Image To Video**
562
+ ```shell
563
+ # run Text-to-Video Generation
564
+ model_id=Skywork/SkyReels-V2-T2V-14B-540P
565
+ torchrun --nproc_per_node=2 generate_video.py \
566
+ --model_id ${model_id} \
567
+ --resolution 540P \
568
+ --num_frames 97 \
569
+ --guidance_scale 6.0 \
570
+ --shift 8.0 \
571
+ --fps 24 \
572
+ --offload \
573
+ --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
574
+ --use_usp \
575
+ --seed 42
576
+ ```
577
+ > **Note**:
578
+ > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
579
+
580
+
581
+ ## Contents
582
+ - [Abstract](#abstract)
583
+ - [Methodology of SkyReels-V2](#methodology-of-skyreels-v2)
584
+ - [Key Contributions of SkyReels-V2](#key-contributions-of-skyreels-v2)
585
+ - [Video Captioner](#video-captioner)
586
+ - [Reinforcement Learning](#reinforcement-learning)
587
+ - [Diffusion Forcing](#diffusion-forcing)
588
+ - [High-Quality Supervised Fine-Tuning(SFT)](#high-quality-supervised-fine-tuning-sft)
589
+ - [Performance](#performance)
590
+ - [Acknowledgements](#acknowledgements)
591
+ - [Citation](#citation)
592
+ ---
593
+
594
+ ## Abstract
595
+ Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation.
596
+
597
+ To address these limitations, we introduce SkyReels-V2, the world's first infinite-length film generative model using a Diffusion Forcing framework. Our approach synergizes Multi-modal Large Language Models (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization. Beyond its technical innovations, SkyReels-V2 enables multiple practical applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and multi-subject consistent video generation through our <a href="https://github.com/SkyworkAI/SkyReels-A2">Skyreels-A2</a> system.
598
+
599
+ ## Methodology of SkyReels-V2
600
+
601
+ The SkyReels-V2 methodology consists of several interconnected components. It starts with a comprehensive data processing pipeline that prepares various quality training data. At its core is the Video Captioner architecture, which provides detailed annotations for video content. The system employs a multi-task pretraining strategy to build fundamental video generation capabilities. Post-training optimization includes Reinforcement Learning to enhance motion quality, Diffusion Forcing Training for generating extended videos, and High-quality Supervised Fine-Tuning (SFT) stages for visual refinement. The model runs on optimized computational infrastructure for efficient training and inference. SkyReels-V2 supports multiple applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and Elements-to-Video Generation.
602
+
603
+ <p align="center">
604
+ <img src="assets/main_pipeline.jpg" alt="mainpipeline" width="100%">
605
+ </p>
606
+
607
+ ## Key Contributions of SkyReels-V2
608
+
609
+ #### Video Captioner
610
+
611
+ <a href="https://huggingface.co/Skywork/SkyCaptioner-V1">SkyCaptioner-V1</a> serves as our video captioning model for data annotation. This model is trained on the captioning result from the base model <a href="https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct">Qwen2.5-VL-72B-Instruct</a> and the sub-expert captioners on a balanced video data. The balanced video data is a carefully curated dataset of approximately 2 million videos to ensure conceptual balance and annotation quality. Built upon the <a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen2.5-VL-7B-Instruct</a> foundation model, <a href="https://huggingface.co/Skywork/SkyCaptioner-V1">SkyCaptioner-V1</a> is fine-tuned to enhance performance in domain-specific video captioning tasks. To compare the performance with the SOTA models, we conducted a manual assessment of accuracy across different captioning fields using a test set of 1,000 samples. The proposed <a href="https://huggingface.co/Skywork/SkyCaptioner-V1">SkyCaptioner-V1</a> achieves the highest average accuracy among the baseline models, and show a dramatic result in the shot related fields
612
+
613
+ <p align="center">
614
+ <table align="center">
615
+ <thead>
616
+ <tr>
617
+ <th>model</th>
618
+ <th><a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen2.5-VL-7B-Ins.</a></th>
619
+ <th><a href="https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct">Qwen2.5-VL-72B-Ins.</a></th>
620
+ <th><a href="https://huggingface.co/omni-research/Tarsier2-Recap-7b">Tarsier2-Recap-7b</a></th>
621
+ <th><a href="https://huggingface.co/Skywork/SkyCaptioner-V1">SkyCaptioner-V1</th>
622
+ </tr>
623
+ </thead>
624
+ <tbody>
625
+ <tr>
626
+ <td>Avg accuracy</td>
627
+ <td>51.4%</td>
628
+ <td>58.7%</td>
629
+ <td>49.4%</td>
630
+ <td><strong>76.3%</strong></td>
631
+ </tr>
632
+ <tr>
633
+ <td>shot type</td>
634
+ <td>76.8%</td>
635
+ <td>82.5%</td>
636
+ <td>60.2%</td>
637
+ <td><strong>93.7%</strong></td>
638
+ </tr>
639
+ <tr>
640
+ <td>shot angle</td>
641
+ <td>60.0%</td>
642
+ <td>73.7%</td>
643
+ <td>52.4%</td>
644
+ <td><strong>89.8%</strong></td>
645
+ </tr>
646
+ <tr>
647
+ <td>shot position</td>
648
+ <td>28.4%</td>
649
+ <td>32.7%</td>
650
+ <td>23.6%</td>
651
+ <td><strong>83.1%</strong></td>
652
+ </tr>
653
+ <tr>
654
+ <td>camera motion</td>
655
+ <td>62.0%</td>
656
+ <td>61.2%</td>
657
+ <td>45.3%</td>
658
+ <td><strong>85.3%</strong></td>
659
+ </tr>
660
+ <tr>
661
+ <td>expression</td>
662
+ <td>43.6%</td>
663
+ <td>51.5%</td>
664
+ <td>54.3%</td>
665
+ <td><strong>68.8%</strong></td>
666
+ </tr>
667
+ <tr>
668
+ <td colspan="5" style="text-align: center; border-bottom: 1px solid #ddd; padding: 8px;"></td>
669
+ </tr>
670
+ <tr>
671
+ <td>TYPES_type</td>
672
+ <td>43.5%</td>
673
+ <td>49.7%</td>
674
+ <td>47.6%</td>
675
+ <td><strong>82.5%</strong></td>
676
+ </tr>
677
+ <tr>
678
+ <td>TYPES_sub_type</td>
679
+ <td>38.9%</td>
680
+ <td>44.9%</td>
681
+ <td>45.9%</td>
682
+ <td><strong>75.4%</strong></td>
683
+ </tr>
684
+ <tr>
685
+ <td>appearance</td>
686
+ <td>40.9%</td>
687
+ <td>52.0%</td>
688
+ <td>45.6%</td>
689
+ <td><strong>59.3%</strong></td>
690
+ </tr>
691
+ <tr>
692
+ <td>action</td>
693
+ <td>32.4%</td>
694
+ <td>52.0%</td>
695
+ <td><strong>69.8%</strong></td>
696
+ <td>68.8%</td>
697
+ </tr>
698
+ <tr>
699
+ <td>position</td>
700
+ <td>35.4%</td>
701
+ <td>48.6%</td>
702
+ <td>45.5%</td>
703
+ <td><strong>57.5%</strong></td>
704
+ </tr>
705
+ <tr>
706
+ <td>is_main_subject</td>
707
+ <td>58.5%</td>
708
+ <td>68.7%</td>
709
+ <td>69.7%</td>
710
+ <td><strong>80.9%</strong></td>
711
+ </tr>
712
+ <tr>
713
+ <td>environment</td>
714
+ <td>70.4%</td>
715
+ <td><strong>72.7%</strong></td>
716
+ <td>61.4%</td>
717
+ <td>70.5%</td>
718
+ </tr>
719
+ <tr>
720
+ <td>lighting</td>
721
+ <td>77.1%</td>
722
+ <td><strong>80.0%</strong></td>
723
+ <td>21.2%</td>
724
+ <td>76.5%</td>
725
+ </tr>
726
+ </tbody>
727
+ </table>
728
+ </p>
729
+
730
+ #### Reinforcement Learning
731
+ Inspired by the previous success in LLM, we propose to enhance the performance of the generative model by Reinforcement Learning. Specifically, we focus on the motion quality because we find that the main drawback of our generative model is:
732
+
733
+ - the generative model does not handle well with large, deformable motions.
734
+ - the generated videos may violate the physical law.
735
+
736
+ To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model.
737
+
738
+ #### Diffusion Forcing
739
+
740
+ We introduce the Diffusion Forcing Transformer to unlock our model��s ability to generate long videos. Diffusion Forcing is a training and sampling strategy where each token is assigned an independent noise level. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this approach functions as a form of partial masking: a token with zero noise is fully unmasked, while complete noise fully masks it. Diffusion Forcing trains the model to "unmask" any combination of variably noised tokens, using the cleaner tokens as conditional information to guide the recovery of noisy ones. Building on this, our Diffusion Forcing Transformer can extend video generation indefinitely based on the last frames of the previous segment. Note that the synchronous full sequence diffusion is a special case of Diffusion Forcing, where all tokens share the same noise level. This relationship allows us to fine-tune the Diffusion Forcing Transformer from a full-sequence diffusion model.
741
+
742
+ #### High-Quality Supervised Fine-Tuning (SFT)
743
+
744
+ We implement two sequential high-quality supervised fine-tuning (SFT) stages at 540p and 720p resolutions respectively, with the initial SFT phase conducted immediately after pretraining but prior to reinforcement learning (RL) stage.This first-stage SFT serves as a conceptual equilibrium trainer, building upon the foundation model’s pretraining outcomes that utilized only fps24 video data, while strategically removing FPS embedding components to streamline thearchitecture. Trained with the high-quality concept-balanced samples, this phase establishes optimized initialization parameters for subsequent training processes. Following this, we execute a secondary high-resolution SFT at 720p after completing the diffusion forcing stage, incorporating identical loss formulations and the higher-quality concept-balanced datasets by the manually filter. This final refinement phase focuses on resolution increase such that the overall video quality will be further enhanced.
745
+
746
+ ## Performance
747
+
748
+ To comprehensively evaluate our proposed method, we construct the SkyReels-Bench for human assessment and leveraged the open-source <a href="https://github.com/Vchitect/VBench">V-Bench</a> for automated evaluation. This allows us to compare our model with the state-of-the-art (SOTA) baselines, including both open-source and proprietary models.
749
+
750
+ #### Human Evaluation
751
+
752
+ For human evaluation, we design SkyReels-Bench with 1,020 text prompts, systematically assessing three dimensions: Instruction Adherence, Motion Quality, Consistency and Visual Quality. This benchmark is designed to evaluate both text-to-video (T2V) and image-to-video (I2V) generation models, providing comprehensive assessment across different generation paradigms. To ensure fairness, all models were evaluated under default settings with consistent resolutions, and no post-generation filtering was applied.
753
+
754
+ - Text To Video Models
755
+
756
+ <p align="center">
757
+ <table align="center">
758
+ <thead>
759
+ <tr>
760
+ <th>Model Name</th>
761
+ <th>Average</th>
762
+ <th>Instruction Adherence</th>
763
+ <th>Consistency</th>
764
+ <th>Visual Quality</th>
765
+ <th>Motion Quality</th>
766
+ </tr>
767
+ </thead>
768
+ <tbody>
769
+ <tr>
770
+ <td><a href="https://runwayml.com/research/introducing-gen-3-alpha">Runway-Gen3 Alpha</a></td>
771
+ <td>2.53</td>
772
+ <td>2.19</td>
773
+ <td>2.57</td>
774
+ <td>3.23</td>
775
+ <td>2.11</td>
776
+ </tr>
777
+ <tr>
778
+ <td><a href="https://github.com/Tencent/HunyuanVideo">HunyuanVideo-13B</a></td>
779
+ <td>2.82</td>
780
+ <td>2.64</td>
781
+ <td>2.81</td>
782
+ <td>3.20</td>
783
+ <td>2.61</td>
784
+ </tr>
785
+ <tr>
786
+ <td><a href="https://klingai.com">Kling-1.6 STD Mode</a></td>
787
+ <td>2.99</td>
788
+ <td>2.77</td>
789
+ <td>3.05</td>
790
+ <td>3.39</td>
791
+ <td><strong>2.76</strong></td>
792
+ </tr>
793
+ <tr>
794
+ <td><a href="https://hailuoai.video">Hailuo-01</a></td>
795
+ <td>3.0</td>
796
+ <td>2.8</td>
797
+ <td>3.08</td>
798
+ <td>3.29</td>
799
+ <td>2.74</td>
800
+ </tr>
801
+ <tr>
802
+ <td><a href="https://github.com/Wan-Video/Wan2.1">Wan2.1-14B</a></td>
803
+ <td>3.12</td>
804
+ <td>2.91</td>
805
+ <td>3.31</td>
806
+ <td><strong>3.54</strong></td>
807
+ <td>2.71</td>
808
+ </tr>
809
+ <tr>
810
+ <td>SkyReels-V2</td>
811
+ <td><strong>3.14</strong></td>
812
+ <td><strong>3.15</strong></td>
813
+ <td><strong>3.35</strong></td>
814
+ <td>3.34</td>
815
+ <td>2.74</td>
816
+ </tr>
817
+ </tbody>
818
+ </table>
819
+ </p>
820
+
821
+ The evaluation demonstrates that our model achieves significant advancements in **instruction adherence (3.15)** compared to baseline methods, while maintaining competitive performance in **motion quality (2.74)** without sacrificing the **consistency (3.35)**.
822
+
823
+ - Image To Video Models
824
+
825
+ <p align="center">
826
+ <table align="center">
827
+ <thead>
828
+ <tr>
829
+ <th>Model</th>
830
+ <th>Average</th>
831
+ <th>Instruction Adherence</th>
832
+ <th>Consistency</th>
833
+ <th>Visual Quality</th>
834
+ <th>Motion Quality</th>
835
+ </tr>
836
+ </thead>
837
+ <tbody>
838
+ <tr>
839
+ <td><a href="https://github.com/Tencent/HunyuanVideo">HunyuanVideo-13B</a></td>
840
+ <td>2.84</td>
841
+ <td>2.97</td>
842
+ <td>2.95</td>
843
+ <td>2.87</td>
844
+ <td>2.56</td>
845
+ </tr>
846
+ <tr>
847
+ <td><a href="https://github.com/Wan-Video/Wan2.1">Wan2.1-14B</a></td>
848
+ <td>2.85</td>
849
+ <td>3.10</td>
850
+ <td>2.81</td>
851
+ <td>3.00</td>
852
+ <td>2.48</td>
853
+ </tr>
854
+ <tr>
855
+ <td><a href="https://hailuoai.video">Hailuo-01</a></td>
856
+ <td>3.05</td>
857
+ <td>3.31</td>
858
+ <td>2.58</td>
859
+ <td>3.55</td>
860
+ <td>2.74</td>
861
+ </tr>
862
+ <tr>
863
+ <td><a href="https://klingai.com">Kling-1.6 Pro Mode</a></td>
864
+ <td>3.4</td>
865
+ <td>3.56</td>
866
+ <td>3.03</td>
867
+ <td>3.58</td>
868
+ <td>3.41</td>
869
+ </tr>
870
+ <tr>
871
+ <td><a href="https://runwayml.com/research/introducing-runway-gen-4">Runway-Gen4</a></td>
872
+ <td>3.39</td>
873
+ <td>3.75</td>
874
+ <td>3.2</td>
875
+ <td>3.4</td>
876
+ <td>3.37</td>
877
+ </tr>
878
+ <tr>
879
+ <td>SkyReels-V2-DF</td>
880
+ <td>3.24</td>
881
+ <td>3.64</td>
882
+ <td>3.21</td>
883
+ <td>3.18</td>
884
+ <td>2.93</td>
885
+ </tr>
886
+ <tr>
887
+ <td>SkyReels-V2-I2V</td>
888
+ <td>3.29</td>
889
+ <td>3.42</td>
890
+ <td>3.18</td>
891
+ <td>3.56</td>
892
+ <td>3.01</td>
893
+ </tr>
894
+ </tbody>
895
+ </table>
896
+ </p>
897
+
898
+ Our results demonstrate that both **SkyReels-V2-I2V (3.29)** and **SkyReels-V2-DF (3.24)** achieve state-of-the-art performance among open-source models, significantly outperforming HunyuanVideo-13B (2.84) and Wan2.1-14B (2.85) across all quality dimensions. With an average score of 3.29, SkyReels-V2-I2V demonstrates comparable performance to proprietary models Kling-1.6 (3.4) and Runway-Gen4 (3.39).
899
+
900
+
901
+ #### VBench
902
+ To objectively compare SkyReels-V2 Model against other leading open-source Text-To-Video models, we conduct comprehensive evaluations using the public benchmark <a href="https://github.com/Vchitect/VBench">V-Bench</a>. Our evaluation specifically leverages the benchmark’s longer version prompt. For fair comparison with baseline models, we strictly follow their recommended setting for inference.
903
+
904
+ <p align="center">
905
+ <table align="center">
906
+ <thead>
907
+ <tr>
908
+ <th>Model</th>
909
+ <th>Total Score</th>
910
+ <th>Quality Score</th>
911
+ <th>Semantic Score</th>
912
+ </tr>
913
+ </thead>
914
+ <tbody>
915
+ <tr>
916
+ <td><a href="https://github.com/hpcaitech/Open-Sora">OpenSora 2.0</a></td>
917
+ <td>81.5 %</td>
918
+ <td>82.1 %</td>
919
+ <td>78.2 %</td>
920
+ </tr>
921
+ <tr>
922
+ <td><a href="https://github.com/THUDM/CogVideo">CogVideoX1.5-5B</a></td>
923
+ <td>80.3 %</td>
924
+ <td>80.9 %</td>
925
+ <td>77.9 %</td>
926
+ </tr>
927
+ <tr>
928
+ <td><a href="https://github.com/Tencent/HunyuanVideo">HunyuanVideo-13B</a></td>
929
+ <td>82.7 %</td>
930
+ <td>84.4 %</td>
931
+ <td>76.2 %</td>
932
+ </tr>
933
+ <tr>
934
+ <td><a href="https://github.com/Wan-Video/Wan2.1">Wan2.1-14B</a></td>
935
+ <td>83.7 %</td>
936
+ <td>84.2 %</td>
937
+ <td><strong>81.4 %</strong></td>
938
+ </tr>
939
+ <tr>
940
+ <td>SkyReels-V2</td>
941
+ <td><strong>83.9 %</strong></td>
942
+ <td><strong>84.7 %</strong></td>
943
+ <td>80.8 %</td>
944
+ </tr>
945
+ </tbody>
946
+ </table>
947
+ </p>
948
+
949
+ The VBench results demonstrate that SkyReels-V2 outperforms all compared models including HunyuanVideo-13B and Wan2.1-14B, With the highest **total score (83.9%)** and **quality score (84.7%)**. In this evaluation, the semantic score is slightly lower than Wan2.1-14B, while we outperform Wan2.1-14B in human evaluations, with the primary gap attributed to V-Bench’s insufficient evaluation of shot-scenario semantic adherence.
950
+
951
+ ## Acknowledgements
952
+ We would like to thank the contributors of <a href="https://github.com/Wan-Video/Wan2.1">Wan 2.1</a>, <a href="https://github.com/xdit-project/xDiT">XDit</a> and <a href="https://qwenlm.github.io/blog/qwen2.5/">Qwen 2.5</a> repositories, for their open research and contributions.
953
+
954
+ ## Citation
955
+
956
+ ```bibtex
957
+ @misc{chen2025skyreelsv2infinitelengthfilmgenerative,
958
+ title={SkyReels-V2: Infinite-length Film Generative Model},
959
+ author={Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang Li and Yahui Zhou},
960
+ year={2025},
961
+ eprint={2504.13074},
962
+ archivePrefix={arXiv},
963
+ primaryClass={cs.CV},
964
+ url={https://arxiv.org/abs/2504.13074},
965
+ }
966
+ ```
app.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+ import os
3
+ import tempfile
4
+ import torch
5
+ import gradio as gr
6
+ from pathlib import Path
7
+ # importez ici votre pipeline / modèle real (ex: from diffuse_video import VideoGenerator)
8
+
9
+ # Exemple minimal : remplacez la fonction `generate_video` par votre pipeline réel.
10
+ def load_model():
11
+ # Exemple : si vous avez un modèle HF Hub, vous pouvez le charger ici
12
+ # model = YourModelClass.from_pretrained("path_or_repo")
13
+ # model.to(device)
14
+ device = "cuda" if torch.cuda.is_available() else "cpu"
15
+ print("Device:", device)
16
+ # Placeholder: rien à charger ici
17
+ model = None
18
+ return model, device
19
+
20
+ MODEL, DEVICE = load_model()
21
+
22
+ def generate_video(prompt: str, duration_sec: int = 3):
23
+ """
24
+ Remplacez le contenu de cette fonction par l'appel réel à votre IA.
25
+ Doit retourner le chemin d'un fichier vidéo (.mp4).
26
+ """
27
+ # === EXEMPLE de stub (génère une vidéo muette noire) ===
28
+ # Dans la version réelle, appelez MODEL et enregistrez la sortie en mp4.
29
+ import numpy as np
30
+ import imageio.v2 as imageio
31
+
32
+ fps = 24
33
+ w, h = 320, 240
34
+ frames = []
35
+ nframes = max(1, int(duration_sec * fps))
36
+ for i in range(nframes):
37
+ # image noire + texte (placeholder)
38
+ frame = np.zeros((h, w, 3), dtype=np.uint8)
39
+ frames.append(frame)
40
+
41
+ out = tempfile.NamedTemporaryFile(suffix=".mp4", delete=False)
42
+ out_path = out.name
43
+ out.close()
44
+ imageio.mimwrite(out_path, frames, fps=fps, macro_block_size=None) # requires ffmpeg
45
+ return out_path
46
+
47
+ # Interface Gradio
48
+ with gr.Blocks() as demo:
49
+ gr.Markdown("# Démo IA vidéo")
50
+ with gr.Row():
51
+ prompt = gr.Textbox(label="Prompt / Description", lines=2, placeholder="Entrez ce que la vidéo doit contenir")
52
+ duration = gr.Slider(1, 10, value=3, step=1, label="Durée (sec)")
53
+ gen_btn = gr.Button("Générer")
54
+ video_out = gr.Video(label="Vidéo générée")
55
+
56
+ def run(prompt, duration):
57
+ path = generate_video(prompt, duration)
58
+ return path
59
+
60
+ gen_btn.click(run, inputs=[prompt, duration], outputs=[video_out])
61
+
62
+ if __name__ == "__main__":
63
+ demo.launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", 7860)))
generate_video.py ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import gc
3
+ import os
4
+ import random
5
+ import time
6
+
7
+ import imageio
8
+ import torch
9
+ from diffusers.utils import load_image
10
+
11
+ from skyreels_v2_infer.modules import download_model
12
+ from skyreels_v2_infer.pipelines import Image2VideoPipeline
13
+ from skyreels_v2_infer.pipelines import PromptEnhancer
14
+ from skyreels_v2_infer.pipelines import resizecrop
15
+ from skyreels_v2_infer.pipelines import Text2VideoPipeline
16
+
17
+ MODEL_ID_CONFIG = {
18
+ "text2video": [
19
+ "Skywork/SkyReels-V2-T2V-14B-540P",
20
+ "Skywork/SkyReels-V2-T2V-14B-720P",
21
+ ],
22
+ "image2video": [
23
+ "Skywork/SkyReels-V2-I2V-1.3B-540P",
24
+ "Skywork/SkyReels-V2-I2V-14B-540P",
25
+ "Skywork/SkyReels-V2-I2V-14B-720P",
26
+ ],
27
+ }
28
+
29
+
30
+ if __name__ == "__main__":
31
+
32
+ parser = argparse.ArgumentParser()
33
+ parser.add_argument("--outdir", type=str, default="video_out")
34
+ parser.add_argument("--model_id", type=str, default="Skywork/SkyReels-V2-T2V-14B-540P")
35
+ parser.add_argument("--resolution", type=str, choices=["540P", "720P"])
36
+ parser.add_argument("--num_frames", type=int, default=97)
37
+ parser.add_argument("--image", type=str, default=None)
38
+ parser.add_argument("--guidance_scale", type=float, default=6.0)
39
+ parser.add_argument("--shift", type=float, default=8.0)
40
+ parser.add_argument("--inference_steps", type=int, default=30)
41
+ parser.add_argument("--use_usp", action="store_true")
42
+ parser.add_argument("--offload", action="store_true")
43
+ parser.add_argument("--fps", type=int, default=24)
44
+ parser.add_argument("--seed", type=int, default=None)
45
+ parser.add_argument(
46
+ "--prompt",
47
+ type=str,
48
+ default="A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface.",
49
+ )
50
+ parser.add_argument("--prompt_enhancer", action="store_true")
51
+ parser.add_argument("--teacache", action="store_true")
52
+ parser.add_argument(
53
+ "--teacache_thresh",
54
+ type=float,
55
+ default=0.2,
56
+ help="Higher speedup will cause to worse quality -- 0.1 for 2.0x speedup -- 0.2 for 3.0x speedup")
57
+ parser.add_argument(
58
+ "--use_ret_steps",
59
+ action="store_true",
60
+ help="Using Retention Steps will result in faster generation speed and better generation quality.")
61
+ args = parser.parse_args()
62
+
63
+ args.model_id = download_model(args.model_id)
64
+ print("model_id:", args.model_id)
65
+
66
+ assert (args.use_usp and args.seed is not None) or (not args.use_usp), "usp mode need seed"
67
+ if args.seed is None:
68
+ random.seed(time.time())
69
+ args.seed = int(random.randrange(4294967294))
70
+
71
+ if args.resolution == "540P":
72
+ height = 544
73
+ width = 960
74
+ elif args.resolution == "720P":
75
+ height = 720
76
+ width = 1280
77
+ else:
78
+ raise ValueError(f"Invalid resolution: {args.resolution}")
79
+
80
+ image = load_image(args.image).convert("RGB") if args.image else None
81
+ negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
82
+ local_rank = 0
83
+ if args.use_usp:
84
+ assert not args.prompt_enhancer, "`--prompt_enhancer` is not allowed if using `--use_usp`. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the `--use_usp` parameter."
85
+ from xfuser.core.distributed import initialize_model_parallel, init_distributed_environment
86
+ import torch.distributed as dist
87
+
88
+ dist.init_process_group("nccl")
89
+ local_rank = dist.get_rank()
90
+ torch.cuda.set_device(dist.get_rank())
91
+ device = "cuda"
92
+
93
+ init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size())
94
+
95
+ initialize_model_parallel(
96
+ sequence_parallel_degree=dist.get_world_size(),
97
+ ring_degree=1,
98
+ ulysses_degree=dist.get_world_size(),
99
+ )
100
+
101
+ prompt_input = args.prompt
102
+ if args.prompt_enhancer and args.image is None:
103
+ print(f"init prompt enhancer")
104
+ prompt_enhancer = PromptEnhancer()
105
+ prompt_input = prompt_enhancer(prompt_input)
106
+ print(f"enhanced prompt: {prompt_input}")
107
+ del prompt_enhancer
108
+ gc.collect()
109
+ torch.cuda.empty_cache()
110
+
111
+ if image is None:
112
+ assert "T2V" in args.model_id, f"check model_id:{args.model_id}"
113
+ print("init text2video pipeline")
114
+ pipe = Text2VideoPipeline(
115
+ model_path=args.model_id, dit_path=args.model_id, use_usp=args.use_usp, offload=args.offload
116
+ )
117
+ else:
118
+ assert "I2V" in args.model_id, f"check model_id:{args.model_id}"
119
+ print("init img2video pipeline")
120
+ pipe = Image2VideoPipeline(
121
+ model_path=args.model_id, dit_path=args.model_id, use_usp=args.use_usp, offload=args.offload
122
+ )
123
+ args.image = load_image(args.image)
124
+ image_width, image_height = args.image.size
125
+ if image_height > image_width:
126
+ height, width = width, height
127
+ args.image = resizecrop(args.image, height, width)
128
+
129
+ if args.teacache:
130
+ pipe.transformer.initialize_teacache(enable_teacache=True, num_steps=args.inference_steps,
131
+ teacache_thresh=args.teacache_thresh, use_ret_steps=args.use_ret_steps,
132
+ ckpt_dir=args.model_id)
133
+
134
+
135
+ kwargs = {
136
+ "prompt": prompt_input,
137
+ "negative_prompt": negative_prompt,
138
+ "num_frames": args.num_frames,
139
+ "num_inference_steps": args.inference_steps,
140
+ "guidance_scale": args.guidance_scale,
141
+ "shift": args.shift,
142
+ "generator": torch.Generator(device="cuda").manual_seed(args.seed),
143
+ "height": height,
144
+ "width": width,
145
+ }
146
+
147
+ if image is not None:
148
+ kwargs["image"] = args.image.convert("RGB")
149
+
150
+ save_dir = os.path.join("result", args.outdir)
151
+ os.makedirs(save_dir, exist_ok=True)
152
+
153
+ with torch.cuda.amp.autocast(dtype=pipe.transformer.dtype), torch.no_grad():
154
+ print(f"infer kwargs:{kwargs}")
155
+ video_frames = pipe(**kwargs)[0]
156
+
157
+ if local_rank == 0:
158
+ current_time = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime())
159
+ video_out_file = f"{args.prompt[:100].replace('/','')}_{args.seed}_{current_time}.mp4"
160
+ output_path = os.path.join(save_dir, video_out_file)
161
+ imageio.mimwrite(output_path, video_frames, fps=args.fps, quality=8, output_params=["-loglevel", "error"])
generate_video_df.py ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import gc
3
+ import os
4
+ import random
5
+ import time
6
+
7
+ import imageio
8
+ import torch
9
+ from diffusers.utils import load_image
10
+
11
+ from skyreels_v2_infer import DiffusionForcingPipeline
12
+ from skyreels_v2_infer.modules import download_model
13
+ from skyreels_v2_infer.pipelines import PromptEnhancer
14
+ from skyreels_v2_infer.pipelines.image2video_pipeline import resizecrop
15
+ from moviepy.editor import VideoFileClip
16
+
17
+
18
+ def get_video_num_frames_moviepy(video_path):
19
+ with VideoFileClip(video_path) as clip:
20
+ num_frames = 0
21
+ for _ in clip.iter_frames():
22
+ num_frames += 1
23
+ return clip.size, num_frames
24
+
25
+
26
+ if __name__ == "__main__":
27
+ parser = argparse.ArgumentParser()
28
+ parser.add_argument("--outdir", type=str, default="diffusion_forcing")
29
+ parser.add_argument("--model_id", type=str, default="Skywork/SkyReels-V2-DF-1.3B-540P")
30
+ parser.add_argument("--resolution", type=str, choices=["540P", "720P"])
31
+ parser.add_argument("--num_frames", type=int, default=97)
32
+ parser.add_argument("--image", type=str, default=None)
33
+ parser.add_argument("--end_image", type=str, default=None)
34
+ parser.add_argument("--video_path", type=str, default='')
35
+ parser.add_argument("--ar_step", type=int, default=0)
36
+ parser.add_argument("--causal_attention", action="store_true")
37
+ parser.add_argument("--causal_block_size", type=int, default=1)
38
+ parser.add_argument("--base_num_frames", type=int, default=97)
39
+ parser.add_argument("--overlap_history", type=int, default=None)
40
+ parser.add_argument("--addnoise_condition", type=int, default=0)
41
+ parser.add_argument("--guidance_scale", type=float, default=6.0)
42
+ parser.add_argument("--shift", type=float, default=8.0)
43
+ parser.add_argument("--inference_steps", type=int, default=30)
44
+ parser.add_argument("--use_usp", action="store_true")
45
+ parser.add_argument("--offload", action="store_true")
46
+ parser.add_argument("--fps", type=int, default=24)
47
+ parser.add_argument("--seed", type=int, default=None)
48
+ parser.add_argument(
49
+ "--prompt",
50
+ type=str,
51
+ default="A woman in a leather jacket and sunglasses riding a vintage motorcycle through a desert highway at sunset, her hair blowing wildly in the wind as the motorcycle kicks up dust, with the golden sun casting long shadows across the barren landscape.",
52
+ )
53
+ parser.add_argument("--prompt_enhancer", action="store_true")
54
+ parser.add_argument("--teacache", action="store_true")
55
+ parser.add_argument(
56
+ "--teacache_thresh",
57
+ type=float,
58
+ default=0.2,
59
+ help="Higher speedup will cause to worse quality -- 0.1 for 2.0x speedup -- 0.2 for 3.0x speedup")
60
+ parser.add_argument(
61
+ "--use_ret_steps",
62
+ action="store_true",
63
+ help="Using Retention Steps will result in faster generation speed and better generation quality.")
64
+ args = parser.parse_args()
65
+
66
+ args.model_id = download_model(args.model_id)
67
+ print("model_id:", args.model_id)
68
+
69
+ assert (args.use_usp and args.seed is not None) or (not args.use_usp), "usp mode need seed"
70
+ if args.seed is None:
71
+ random.seed(time.time())
72
+ args.seed = int(random.randrange(4294967294))
73
+
74
+ if args.resolution == "540P":
75
+ height = 544
76
+ width = 960
77
+ elif args.resolution == "720P":
78
+ height = 720
79
+ width = 1280
80
+ else:
81
+ raise ValueError(f"Invalid resolution: {args.resolution}")
82
+
83
+ num_frames = args.num_frames
84
+ fps = args.fps
85
+
86
+ if num_frames > args.base_num_frames:
87
+ assert (
88
+ args.overlap_history is not None
89
+ ), 'You are supposed to specify the "overlap_history" to support the long video generation. 17 and 37 are recommanded to set.'
90
+ if args.addnoise_condition > 60:
91
+ print(
92
+ f'You have set "addnoise_condition" as {args.addnoise_condition}. The value is too large which can cause inconsistency in long video generation. The value is recommanded to set 20.'
93
+ )
94
+
95
+ guidance_scale = args.guidance_scale
96
+ shift = args.shift
97
+
98
+ negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
99
+
100
+ save_dir = os.path.join("result", args.outdir)
101
+ os.makedirs(save_dir, exist_ok=True)
102
+ local_rank = 0
103
+ if args.use_usp:
104
+ assert not args.prompt_enhancer, "`--prompt_enhancer` is not allowed if using `--use_usp`. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the `--use_usp` parameter."
105
+ from xfuser.core.distributed import initialize_model_parallel, init_distributed_environment
106
+ import torch.distributed as dist
107
+
108
+ dist.init_process_group("nccl")
109
+ local_rank = dist.get_rank()
110
+ torch.cuda.set_device(dist.get_rank())
111
+ device = "cuda"
112
+
113
+ init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size())
114
+
115
+ initialize_model_parallel(
116
+ sequence_parallel_degree=dist.get_world_size(),
117
+ ring_degree=1,
118
+ ulysses_degree=dist.get_world_size(),
119
+ )
120
+
121
+ prompt_input = args.prompt
122
+ if args.prompt_enhancer and args.image is None:
123
+ print(f"init prompt enhancer")
124
+ prompt_enhancer = PromptEnhancer()
125
+ prompt_input = prompt_enhancer(prompt_input)
126
+ print(f"enhanced prompt: {prompt_input}")
127
+ del prompt_enhancer
128
+ gc.collect()
129
+ torch.cuda.empty_cache()
130
+
131
+ pipe = DiffusionForcingPipeline(
132
+ args.model_id,
133
+ dit_path=args.model_id,
134
+ device=torch.device("cuda"),
135
+ weight_dtype=torch.bfloat16,
136
+ use_usp=args.use_usp,
137
+ offload=args.offload,
138
+ )
139
+
140
+ if args.causal_attention:
141
+ pipe.transformer.set_ar_attention(args.causal_block_size)
142
+
143
+ if args.teacache:
144
+ if args.ar_step > 0:
145
+ num_steps = args.inference_steps + (((args.base_num_frames - 1) // 4 + 1) // args.causal_block_size - 1) * args.ar_step
146
+ print('num_steps:', num_steps)
147
+ else:
148
+ num_steps = args.inference_steps
149
+ pipe.transformer.initialize_teacache(enable_teacache=True, num_steps=num_steps,
150
+ teacache_thresh=args.teacache_thresh, use_ret_steps=args.use_ret_steps,
151
+ ckpt_dir=args.model_id)
152
+
153
+ print(f"prompt:{prompt_input}")
154
+ print(f"guidance_scale:{guidance_scale}")
155
+
156
+ if os.path.exists(args.video_path):
157
+ (v_width, v_height), input_num_frames = get_video_num_frames_moviepy(args.video_path)
158
+ assert input_num_frames >= args.overlap_history, "The input video is too short."
159
+
160
+ if v_height > v_width:
161
+ width, height = height, width
162
+
163
+ video_frames = pipe.extend_video(
164
+ prompt=prompt_input,
165
+ negative_prompt=negative_prompt,
166
+ prefix_video_path=args.video_path,
167
+ height=height,
168
+ width=width,
169
+ num_frames=num_frames,
170
+ num_inference_steps=args.inference_steps,
171
+ shift=shift,
172
+ guidance_scale=guidance_scale,
173
+ generator=torch.Generator(device="cuda").manual_seed(args.seed),
174
+ overlap_history=args.overlap_history,
175
+ addnoise_condition=args.addnoise_condition,
176
+ base_num_frames=args.base_num_frames,
177
+ ar_step=args.ar_step,
178
+ causal_block_size=args.causal_block_size,
179
+ fps=fps,
180
+ )[0]
181
+ else:
182
+ if args.image:
183
+ args.image = load_image(args.image)
184
+ image_width, image_height = args.image.size
185
+ if image_height > image_width:
186
+ height, width = width, height
187
+ args.image = resizecrop(args.image, height, width)
188
+ if args.end_image:
189
+ args.end_image = load_image(args.end_image)
190
+ args.end_image = resizecrop(args.end_image, height, width)
191
+
192
+ image = args.image.convert("RGB") if args.image else None
193
+ end_image = args.end_image.convert("RGB") if args.end_image else None
194
+
195
+ with torch.cuda.amp.autocast(dtype=pipe.transformer.dtype), torch.no_grad():
196
+ video_frames = pipe(
197
+ prompt=prompt_input,
198
+ negative_prompt=negative_prompt,
199
+ image=image,
200
+ end_image=end_image,
201
+ height=height,
202
+ width=width,
203
+ num_frames=num_frames,
204
+ num_inference_steps=args.inference_steps,
205
+ shift=shift,
206
+ guidance_scale=guidance_scale,
207
+ generator=torch.Generator(device="cuda").manual_seed(args.seed),
208
+ overlap_history=args.overlap_history,
209
+ addnoise_condition=args.addnoise_condition,
210
+ base_num_frames=args.base_num_frames,
211
+ ar_step=args.ar_step,
212
+ causal_block_size=args.causal_block_size,
213
+ fps=fps,
214
+ )[0]
215
+
216
+ if local_rank == 0:
217
+ current_time = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime())
218
+ video_out_file = f"{args.prompt[:100].replace('/','')}_{args.seed}_{current_time}.mp4"
219
+ output_path = os.path.join(save_dir, video_out_file)
220
+ imageio.mimwrite(output_path, video_frames, fps=fps, quality=8, output_params=["-loglevel", "error"])
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch==2.5.1
2
+ torchvision==0.20.1
3
+ opencv-python==4.10.0.84
4
+ diffusers>=0.31.0
5
+ transformers==4.49.0
6
+ tokenizers==0.21.1
7
+ accelerate==1.6.0
8
+ tqdm
9
+ imageio
10
+ easydict
11
+ ftfy
12
+ dashscope
13
+ imageio-ffmpeg
14
+ flash_attn
15
+ numpy>=1.23.5,<2
16
+ xfuser