Image-Text-to-Video
Safetensors
bernini_renderer
File size: 11,378 Bytes
db9c773
 
 
 
214035b
 
 
 
dddb3ee
214035b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcede06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214035b
 
 
 
 
 
bcede06
214035b
 
 
6976951
 
 
 
 
bcede06
6976951
 
214035b
 
 
 
 
 
 
 
 
 
bcede06
214035b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db9c773
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
license: apache-2.0
pipeline_tag: image-text-to-video
---
<div align="center">

<img src="assets/bernini-icon.png" width="560" alt="Bernini"/>

<h4 align="center">Latent Semantic Planning for Video Diffusion</h4>

**Chenchen Liu<sup>\*</sup>, Junyi Chen<sup>\*</sup>, Lei Li<sup>\*</sup>, Lu Chi<sup>\*,Β§</sup>, Mingzhen Sun<sup>\*</sup>, Zhuoying Li<sup>\*</sup>, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan<sup>βœ‰</sup>**

<sup>\*</sup> Equal contribution&nbsp;&nbsp;<sup>βœ‰</sup> Corresponding author&nbsp;&nbsp;<sup>Β§</sup> Project lead

[![arXiv](https://img.shields.io/badge/arXiv-2605.22344-b31b1b.svg)](https://arxiv.org/abs/2605.22344)
[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://bernini-ai.github.io/)
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Models-yellow)](https://huggingface.co/ByteDance/Bernini)

</div>

## πŸŽ‰ News

- **[2026-06-01]** We open-sourced the inference code and model weights of the Bernini Renderer (**Bernini-R**).
- **[2026-05-22]** We released our paper [Bernini: Latent Semantic Planning for Video Diffusion](https://arxiv.org/abs/2605.22344).

## ✨ Highlights

Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.

On video editing, Bernini reaches the first tier among leading closed-source
commercial models. The leaderboard below comes from our self-built arena
platform, where human annotators blindly vote on paired edits and the votes are
aggregated into a Bradley-Terry score and a pairwise win-rate matrix.

<img src="assets/arena.png" width="900" alt="Video editing arena: Bradley-Terry leaderboard and pairwise win-rate matrix"/>

## πŸ“¦ Installation

### Requirements

- **Python** 3.11.2.
- **CUDA GPU** β€” a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3
  can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
- **CUDA toolkit** 12.4 (matches the pinned `torch==2.5.1+cu124`; 12.3+ is the
  minimum if you build FlashAttention-3).
- Pinned in `requirements.txt`: `torch==2.5.1+cu124`, `diffusers==0.35.2`,
  `accelerate==0.34.2`, `transformers==4.57.3`.

Reference environment (Bernini-R is developed and tested on this setup):

| Component | Version      |
|-----------|--------------|
| GPU       | NVIDIA H100  |
| CUDA      | 12.4         |
| Python    | 3.11.2       |
| PyTorch   | 2.5.1+cu124  |

### Install

```bash
git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt
```

Optional extras:

- **Multi-GPU sequence parallel** needs [Open-VeOmni](https://github.com/ByteDance-Seed/VeOmni)
  (Apache-2.0, Python 3.11). Use `--no-deps` so VeOmni does not pull in a
  different torch build and override the pinned `torch==2.5.1+cu124`:
  `pip install --no-deps git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.10`.
  Single-GPU inference does not need it.
- **Faster attention** (auto-detected if installed; otherwise PyTorch SDPA is used):
  - FlashAttention-2 β€” general CUDA GPUs (incl. A100/A800): `pip install flash-attn==2.8.3`.
  - FlashAttention-3 β€” Hopper only (H100/H800/H200, CUDA β‰₯ 12.3, PyTorch β‰₯ 2.4).
    `flash_attn_interface` is not on PyPI; build it from the
    [flash-attention](https://github.com/Dao-AILab/flash-attention) repo's
    `hopper/` directory at tag `v2.8.3`:
    ```bash
    git clone https://github.com/Dao-AILab/flash-attention.git
    cd flash-attention && git checkout v2.8.3
    cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user
    ```

### Weights

Bernini-R provides two ways to obtain the renderer weights. The **diffusers
format is recommended** β€” it is a self-contained diffusers-format directory whose
`transformer` / `transformer_2` already hold the Bernini-R weights, so you point
`--config` at it and the weights load directly, with **no** `--high_noise_ckpt` /
`--low_noise_ckpt` needed.

#### Option A β€” diffusers format (recommended)

A single ready-to-use diffusers-format model from
[`ByteDance/Bernini-R-Diffusers`](https://huggingface.co/ByteDance/Bernini-R-Diffusers).
It bundles the Wan2.2 base components (VAE, UMT5 text encoder, tokenizer) together
with the Bernini-R transformer weights, so nothing else is downloaded at runtime.

```bash
pip install -U "huggingface_hub"
hf download ByteDance/Bernini-R-Diffusers --local-dir Bernini-R-Diffusers
```

Then pass it via `--config` and omit the checkpoint flags, e.g.:

```bash
python infer_single_gpu.py --config Bernini-R-Diffusers \
    --case assets/testcases/t2i/t2i.json --num_frames 1
```

#### Option B β€” separate checkpoints

The original layout, where Bernini-R uses two sets of weights loaded separately:

1. **Wan2.2 base** β€” [`Wan-AI/Wan2.2-T2V-A14B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) on Hugging Face. Supplies the
   VAE, UMT5 text encoder, tokenizer, and the transformer architecture/base weights.
   It is downloaded automatically on first run (configured by `wan22_base` in
   `configs/bernini_renderer_wan22/config.json`).
2. **Bernini-R checkpoint** β€” the trained high-noise / low-noise transformer weights
   (safetensors) from [ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R), passed with
   `--high_noise_ckpt` / `--low_noise_ckpt`. Both a local directory and a Hugging
   Face repo id are accepted.

Download models using huggingface-cli:

```bash
pip install -U "huggingface_hub"
hf download Wan-AI/Wan2.2-T2V-A14B-Diffusers --local-dir Wan2.2-T2V-A14B-Diffusers
hf download ByteDance/Bernini-R --local-dir Bernini-R
```

## πŸš€ Usage

A run is described by a **case file** β€” a small JSON under
[`assets/testcases/`](assets/testcases/) that bundles one task's routing and
inputs (`task_type`, `guidance_mode`, `prompt`, source media, `output`). This
keeps long prompts out of the command line. Each task has a directory under
`assets/testcases/` holding one or more case files; see
[`assets/testcases/`](assets/testcases/) for the format and the bundled
`t2i` / `i2i` / `t2v` / `v2v` / `rv2v` /`r2v` examples.

### Prompt enhancer (highly recommended)

`--use_pe` enhances the prompt through an OpenAI-compatible endpoint and is
recommended for best generation quality. The `openai` SDK is installed by
`requirements.txt`; configure the endpoint with environment variables:

```bash
export BERNINI_PE_API_KEY=...      # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=...     # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=...        # vision-capable chat model
```

### Examples by task type

Unless an example specifies otherwise, inference outputs **480p / 16fps** (the
defaults β€” `--max_image_size 848`, `--fps 16`).

Each example runs a bundled case in
[`assets/testcases/`](assets/testcases/) β€” replace `<hi>` / `<lo>` with your
high-/low-noise checkpoint paths. The image tasks (`t2i`, `i2i`) are shown on a
single GPU; the video tasks on 8 GPUs via `torchrun`, where `--ulysses N` gives
N-way Ulysses sequence parallel per sample and the remaining `world_size / N`
ranks run data parallel over the task list. The two scripts take the same
inputs, so any example can be run either way.

Inputs can also be passed directly as flags instead of `--case` (`--prompt`,
`--task_type`, `--guidance_mode`, `--video`, `--image`, `--images`,
`--output`); generation parameters (`--seed`, `--num_frames`, ...) are always
command-line flags.

**Text-to-image** (`t2i`) β€” single GPU; generates one frame, so pass `--num_frames 1`

```bash
python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
    --case assets/testcases/t2i/t2i.json --num_frames 1
```

**Image editing** (`i2i`) β€” single GPU; generates one frame, so pass `--num_frames 1`

```bash
python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
    --case assets/testcases/i2i/i2i.json --num_frames 1
```

**Text-to-video** (`t2v`)

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/t2v/t2v.json
```

**Video editing** (`v2v` / `mv2v`) β€” two cases are provided.

For edits where the main subject keeps its ordinary motion (case 1 adds a
snowman to the scene), the `v2v` task type is enough:

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/v2v/v2v_case1.json
```

For edits that need to change the subject's motion (case 2 makes the person
crouch down), the `mv2v` task type gives better results:

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/v2v/v2v_case2.json
```

**Reference + video editing** (`rv2v`) β€” two cases are provided.

Case 1 is reference-image-guided video editing β€” replacing a garment in the
source video with one from a reference image:

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/rv2v/rv2v_case1.json
```

Case 2 is a video-insertion example β€” inserting content into the source video.
It is run at 720p / 24fps to show the insertion result more clearly:

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/rv2v/rv2v_case2.json \
    --num_frames 121 --fps 24 --max_image_size 1280
```

**Reference-to-video** (`r2v`) β€” drives a video from one or more reference images

```bash
torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/r2v/r2v.json
```

See `python infer_single_gpu.py --help` for the full argument list.

### Gradio demo

`gradio_demo.py` exposes the same pipeline through a Gradio UI: the task-type
dropdown auto-fills `guidance_mode` (still user-editable), uploaded media is
routed to the matching slot, and the result is rendered inline.

```bash
# Single GPU
python gradio_demo.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860

# 8 GPUs, 8-way Ulysses sequence parallel
torchrun --nproc-per-node 8 gradio_demo.py --ulysses 8 \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860 --share
```

Add `--use_pe` (and `export OPENAI_API_KEY=...` / `BERNINI_PE_API_KEY=...`) to
enable GPT prompt enhancement; the in-UI checkbox is a per-request switch on
top of this flag.

## πŸ“‘ Citation

If you use Bernini in your research, please cite:

```bibtex
@article{bernini,
  title   = {Bernini: Latent Semantic Planning for Video Diffusion},
  author  = {Chenchen Liu and Junyi Chen and Lei Li and Lu Chi and Mingzhen Sun and Zhuoying Li and Yi Fu and Ruoyu Guo and Yiheng Wu and Ge Bai and Zehuan Yuan},
  journal = {arXiv preprint arXiv:2605.22344},
  year    = {2026}
}
```

## πŸ™ Acknowledgements

Bernini builds on several outstanding open-source projects:

- [Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B)
- [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- [VeOmni](https://github.com/ByteDance-Seed/VeOmni)

We thank the authors and communities of these projects for their contributions.

## πŸ“„ License

Apache License 2.0. See [LICENSE](LICENSE).