BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries Paper • 2601.15197 • Published 24 days ago • 54
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights Paper • 2512.01816 • Published Dec 1, 2025 • 93
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs Paper • 2510.01954 • Published Oct 2, 2025 • 14
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning Paper • 2410.06373 • Published Oct 8, 2024 • 36
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Paper • 2406.14544 • Published Jun 20, 2024 • 35
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Paper • 2406.14515 • Published Jun 20, 2024 • 33