Abstract
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
Community
After training on large 2D videos, will video foundation models naturally encode 3D structure and ego-motion? Our study reveals that state-of-the-art video generators develop strong, generalizable 3D understanding even compared to 3D experts, despite being trained only on 2D video data.
Project page: https://vidfm-3d-probe.github.io
arXiv lens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/how-much-3d-do-video-foundation-models-encode-3533-af5a964a
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training (2025)
- Unified Semantic Transformer for 3D Scene Understanding (2025)
- E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training (2025)
- WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling (2025)
- View-Consistent Diffusion Representations for 3D-Consistent Video Generation (2025)
- Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment (2025)
- Emergent Extreme-View Geometry in 3D Foundation Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper