BAAI
/

File size: 5,383 Bytes
b74c32b
 
e0489a1
 
b74c32b
e0489a1
b74c32b
 
 
 
 
e0489a1
b74c32b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0489a1
b74c32b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0489a1
 
 
b74c32b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0489a1
 
b74c32b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0489a1
b74c32b
e0489a1
b74c32b
 
 
 
 
 
 
 
 
 
 
 
 
 
e0489a1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
pipeline_tag: text-to-image
library_name: transformers
---

<div align='center'>
<h1>Emu3.5: Native Multimodal Models are World Learners</h1>

Emu3.5 Team, BAAI

[Project Page](https://emu.world/) | [🤗HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583) | [Code](https://github.com/baaivision/Emu3.5)
</div>


<div align='center'>
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
</div>


<div align='center'>
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/co.png?raw=True" class="interpolation-image" alt="arch." height="90%" width="90%" />
</div>


|  🔹 | **Core Concept**                         | **Description**                                                                                                                            |
| :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- |
|  🧠 | **Unified World Modeling**               | Predicts the **next state jointly across vision and language**, enabling coherent **world modeling** and **generation**.              |
|  🧩 | **End-to-End Pretraining**               | Trained with a **unified next-token prediction** objective over **interleaved vision–language sequences**.                                 |
|  📚 | **Over 10T+ Multimodal Tokens**               | Pre-trained on **over 10 trillion interleaved tokens** from **video frames** and **transcripts**, capturing **spatiotemporal structure**.       |
|  🔄 | **Native Multimodal I/O**                | Processes and generates **interleaved visual–text sequences** without **modality adapters** or **task-specific heads**.                    |
|  🎯 | **RL Post-Training**                     | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**.                           |
|  ⚡  | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding → bidirectional parallel prediction**, achieving **≈20× faster inference without performance loss**.      |
| 🖼️ | **Versatile Generation**                 | Excels in **long-horizon vision–language generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**.                 |
|  🌐 | **Generalizable World Modeling**         | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios.          |\
|  🏆 | **Performance Benchmark**                | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |



## Table of Contents

1. [Model & Weights](#1-model--weights)
2. [Quick Start](#2-quick-start)
3. [Schedule](#3-schedule)
4. [Citation](#4-citation)

## 1. Model & Weights

| Model name               | HF Weight |
| ------------------------ | --------- |
| Emu3.5               | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5/tree/main) |
| Emu3.5-Image                | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
| Emu3.5-VisionTokenizer     | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |

**Emu3.5** handles general tasks(including interleaved generation and image generation/editing), while **Emu3.5-Image** focuses on high-quality image generation/editing.


## 2. Quick Start

### Environment Setup

```bash
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
```
### Configuration

Edit `configs/config.py` to set:

- Paths: `model_path`, `vq_path`
- Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`
- Input image: `use_image` (True to provide reference images, controls <|IMAGE|> token); set `reference_image` in each prompt to specify the image path.
- Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)

### Run Inference

```bash
python inference.py --cfg configs/config.py
```

Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend ≥2 GPUs.

### Visualize Protobuf Outputs

To visualize generated protobuf files:

```bash
python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
```

## 3. Schedule

- [x] Inference Code(auto-regressive version)
- [ ] Advanced Image Decoder
- [ ] Discrete Diffusion Adaptation(DiDA) Inference & Weights


## 4. Citation

```bibtex
@misc{cui2025emu35nativemultimodalmodels,
      title={Emu3.5: Native Multimodal Models are World Learners}, 
      author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
      year={2025},
      eprint={2510.26583},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.26583}, 
}
```