Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Anran-MLLMΒ 
posted an update 3 days ago
Post
3423
πŸš€ Introducing PerceptionDLM β€” the first multimodal diffusion LLM for parallel region perception!

Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. 🧩

✨ Highlights
β€’ ⚑ Up to 3.4Γ— faster on dense multi-region captioning, with stable per-image latency
β€’ πŸ† PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs)
β€’ πŸ“Š New benchmark: ParaDLC-Bench β€” jointly evaluates caption quality AND inference efficiency
β€’ πŸ”“ Code, models & benchmark all open-sourced

πŸ€– Models
MSALab/PerceptionDLM-Base
MSALab/PerceptionDLM

πŸ“Š Benchmark
MSALab/ParaDLC-Bench

πŸ“„ Paper: PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2606.19534)
πŸ’» Code: https://github.com/MSALab-PKU/PerceptionDLM

Diffusion LLMs aren't just for text β€” they unlock efficient, parallel visual perception. πŸ‘οΈβœ¨

#multimodal #diffusion #VLM #perception
In this post