OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning
Abstract
OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.
While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.
Community
OmniCap-IF targets instruction-following in omni-modal video captioning, where models must not only understand visual and audio streams, but also obey complex user-specified structural, stylistic, temporal, visual, audio, and audio-visual constraints. We introduce the OmniCap-IF benchmark for fine-grained checklist-based evaluation, construct the OmniCap-IF-54K instruction-tuning dataset, and train the OmniCaptioner-IF model family to improve controllable omni-video captioning.
Cool paper - I liked the way "OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning" frames the problem without making it feel too abstract.
Curious if you think this would still work once the setup gets messier in the wild?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4cd4178c-741c-4e12-b8b6-ceca732fe4dd
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding (2026)
- AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs (2026)
- StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering (2026)
- Stage-adaptive Token Selection for Efficient Omni-modal LLMs (2026)
- OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder (2026)
- LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning (2026)
- Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.08572 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
NJU-LINK/OmniCaptioner-IF-3B
Datasets citing this paper 2
NJU-LINK/OmniCap-IF
NJU-LINK/OmniCap-IF-54K
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
