OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
Abstract
A novel video-to-script task is introduced along with OmniScript, an 8B-parameter omni-modal language model trained through progressive pipeline techniques for long-form narrative comprehension and temporal localization.
Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.
Community
OmniScript is an 8B-parameter omni-modal LLM for cinematic video understanding and structured script generation. It automatically translates narrative videos into temporally grounded scripts, segmenting them scene-by-scene, identifying precise timestamps, and generating hierarchical details encompassing character actions, dialouges, expressions, and audio cues.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding (2026)
- AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction (2026)
- LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs (2026)
- OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning (2026)
- AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation (2026)
- OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering (2026)
- CutClaw: Agentic Hours-Long Video Editing via Music Synchronization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.11102 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper