VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Abstract
Modern vision-language models exhibit significant challenges in multi-step visual interaction tasks, particularly in long-horizon perception-memory-action integration, with performance declining when handling unbounded historical contexts.
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
Community
We released VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents.
We systematically study the brittleness of vision-language models in multi-step visual interaction, analyze how training choices shape behavior, and open-source the full benchmark, models, and trajectories.
X: https://x.com/zwcolin/status/2015812327338287227
Project: https://visgym.github.io/
Paper: https://arxiv.org/abs/2601.16973
Code: https://github.com/visgym/VisGym
Data & models: https://huggingface.co/VisGym
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VirtualEnv: A Platform for Embodied AI Research (2026)
- RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics (2025)
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation (2026)
- SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL (2025)
- InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search (2025)
- VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression? (2025)
- GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper