Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Abstract
Molmo2 is a new open-source video-language model family that achieves state-of-the-art performance through novel datasets and training methods, particularly excelling in video grounding tasks without relying on proprietary models.
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoWeave: A Data-Centric Approach for Efficient Video Understanding (2026)
- TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs (2025)
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation (2025)
- FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos (2025)
- UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models (2025)
- VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models (2025)
- LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper