Abstract
EasyVideoR1 presents an efficient reinforcement learning framework for video understanding that improves training throughput, supports diverse video tasks, and enables joint image-video training with comprehensive evaluation across multiple benchmarks.
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Community
EasyVideoR1 is an easier RL framework for Video Understanding
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale (2026)
- STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering (2026)
- EVA: Efficient Reinforcement Learning for End-to-End Video Agent (2026)
- VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning (2026)
- Thinking in Streaming Video (2026)
- POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs (2026)
- Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16893 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper