---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- MSALab/LLaDA-8B-Instruct-HF
tags:
- multimodal
- diffusion-language-model
- dllm
- vision-language-model
- perception
---

# PerceptionDLM-Base

**PerceptionDLM-Base** is a strong open **multimodal diffusion language model (DLM)** that extends a large language diffusion backbone (LLaDA-8B) to visual instruction tuning. It establishes a new state-of-the-art baseline among open discrete-diffusion VLMs, outperforming LLaDA-V on **15 / 16** standard multimodal benchmarks while remaining competitive with same-scale autoregressive (AR) VLMs.

It serves as the foundation model for [**PerceptionDLM**](https://huggingface.co/MSALab/PerceptionDLM), our parallel region-perception model.

<p align="center">
  📄 <a href="https://arxiv.org/abs/2606.19534">Paper</a> &nbsp;|&nbsp;
  💻 <a href="https://github.com/MSALab-PKU/PerceptionDLM">Code</a> &nbsp;|&nbsp;
  🤗 <a href="https://huggingface.co/collections/MSALab/perceptiondlm-model-zoo">Model Collection</a>
</p>

## Highlights

- 🧠 **Diffusion-based VLM.** Non-autoregressive masked-denoising generation with intrinsic token-level parallelism.
- 🏗️ **LLaVA-style architecture.** SigLIP-2 vision encoder + 2-layer MLP connector + LLaDA-8B diffusion decoder, with dynamic-resolution tiling for high-resolution inputs.
- 🏆 **Strong baseline.** Outperforms LLaDA-V on 15/16 benchmarks; especially strong on fine-grained perception and hallucination robustness.

## Model Details

| | |
| :--- | :--- |
| Vision encoder | `google/siglip2-so400m-patch16-512` (frozen) |
| Connector | 2-layer MLP with GELU |
| Language backbone | LLaDA-Instruct-8B (diffusion) |
| Parameters | ~8B |
| Training | 4-stage visual instruction tuning, 32× H100 (~3 weeks) |
| Precision | bfloat16 |

## Results

PerceptionDLM-Base vs. open diffusion / AR VLMs (selected benchmarks):

| Benchmark | PerceptionDLM-Base | LLaDA-V | Qwen2.5-VL-7B | InternVL3-8B |
| :--- | :---: | :---: | :---: | :---: |
| MMBench | **85.0** | 82.9 | 83.5 | 83.4 |
| SeedBench | **78.9** | 74.8 | 77.0 | 77.1 |
| ChartQA | **91.6** | 78.3 | 86.2 | 86.6 |
| MMVP | **82.0** | 76.7 | 73.3 | 80.0 |
| BLINK | **60.3** | 50.9 | 55.3 | 55.5 |
| RealWorldQA | **73.7** | 63.2 | 68.4 | 70.8 |
| HallusionBench | **58.4** | 50.9 | 51.9 | 49.9 |

See the [paper](https://arxiv.org/abs/2606.19534) for the full 16-benchmark comparison.

## Usage

Full inference scripts are provided in the [GitHub repository](https://github.com/MSALab-PKU/PerceptionDLM).

```bash
python demo/infer_dmllm.py \
  --model-path MSALab/PerceptionDLM-Base \
  --image assets/demo.jpg \
  --prompt "What color shirt is the man in the picture wearing?" \
  --gen-length 64 --block-length 64 --steps 64
```

```python
import torch
from transformers import AutoModel, AutoProcessor

model_path = "MSALab/PerceptionDLM-Base"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()
# See demo/infer_dmllm.py for the full preprocessing + generation pipeline.
```

## Citation

```bibtex
@article{sun2026perceptiondlm,
  title   = {PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models},
  author  = {Sun, Yueyi and Wang, Yuhao and Li, Jason and Tian, Ye and Zhang, Tao and Mai, Jacky and Wang, Yihan and Wang, Haochen and Bai, Jinbin and Yang, Ling and Tong, Yunhai},
  journal = {arXiv preprint arXiv:2606.19534},
  year    = {2026}
}
```

## License

Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).