PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
Abstract
PlanViz benchmark evaluates unified multimodal models' capabilities in computer-use planning tasks through route planning, work diagramming, and web&UI displaying sub-tasks with a task-adaptive scoring system.
Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
Community
PlanViz Code is available at: https://github.com/lijunxian111/PlanViz (will soon complete)
Supplementary material is at: https://github.com/lijunxian111/PlanViz/releases/tag/v1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing (2026)
- Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image (2025)
- RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing (2025)
- Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs (2026)
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering (2025)
- Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper