nielsr HF Staff commited on
Commit
22a3651
·
verified ·
1 Parent(s): 963db56

Improve LLaVA-Video-7B-Qwen2-TPO model card with abstract and updated links

Browse files

This PR improves the model card for LLaVA-Video-7B-Qwen2-TPO by:

- **Adding the paper abstract** for better context and understanding of the model's capabilities.
- **Updating the paper link** within the model description to point to the canonical Hugging Face paper page (`https://huggingface.co/papers/2501.13919`). The existing arXiv badge is retained.
- **Removing redundant "Project page" and "Code" links** as these are already prominently displayed via badges at the top of the model card.
- **Updating the BibTeX citation** to include an additional author found in the project's GitHub README, ensuring the citation is as complete and accurate as possible.

Files changed (1) hide show
  1. README.md +9 -7
README.md CHANGED
@@ -10,19 +10,20 @@ pipeline_tag: video-text-to-text
10
  <img src="cvpr_figure_TPO.png"></img>
11
  # LLaVA-Video-7B-Qwen2-TPO
12
 
13
- LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://huggingface.co/papers/2501.13919v1), optimized
 
 
 
 
14
  by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
15
  benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
16
  Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
17
 
18
- Project page: https://ruili33.github.io/tpo_website/
19
- Code: https://github.com/ruili33/TPO
20
-
21
  ## Evaluation Results
22
  | **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
23
  |-------------------------------------|----------|---------------------|----------|----------------------|
24
  | **NVILA [1]** | 7B | 57.7 | 70.1 | 64.2/70.0 |
25
- | **LLaVA-Video-7B [2]** | 7B | 58.2 | 70.8 | 63.3/69.7 |
26
  | **LLaVA-Video-7B-Qwen2-TPO** | 7B | **60.1** | **71.1** | **65.6/71.5** |
27
 
28
 
@@ -76,7 +77,8 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
76
  video = [video]
77
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
78
  time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
79
- question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
 
80
  conv = copy.deepcopy(conv_templates[conv_template])
81
  conv.append_message(conv.roles[0], question)
82
  conv.append_message(conv.roles[1], None)
@@ -108,7 +110,7 @@ This project utilizes certain datasets and checkpoints that are subject to their
108
  ```
109
  @article{li2025temporal,
110
  title={Temporal Preference Optimization for Long-Form Video Understanding},
111
- author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
112
  journal={arXiv preprint arXiv:2501.13919},
113
  year={2025}
114
  }
 
10
  <img src="cvpr_figure_TPO.png"></img>
11
  # LLaVA-Video-7B-Qwen2-TPO
12
 
13
+ **Abstract**
14
+
15
+ Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding.
16
+
17
+ LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://huggingface.co/papers/2501.13919), optimized
18
  by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
19
  benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
20
  Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
21
 
 
 
 
22
  ## Evaluation Results
23
  | **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
24
  |-------------------------------------|----------|---------------------|----------|----------------------|
25
  | **NVILA [1]** | 7B | 57.7 | 70.1 | 64.2/70.0 |
26
+ | **LLaVA-Video-7B [2]** | 7B | 58.2 | 70.8 | 63.3/69.7 |\
27
  | **LLaVA-Video-7B-Qwen2-TPO** | 7B | **60.1** | **71.1** | **65.6/71.5** |
28
 
29
 
 
77
  video = [video]
78
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
79
  time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
80
+ question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}
81
+ Please describe this video in detail."
82
  conv = copy.deepcopy(conv_templates[conv_template])
83
  conv.append_message(conv.roles[0], question)
84
  conv.append_message(conv.roles[1], None)
 
110
  ```
111
  @article{li2025temporal,
112
  title={Temporal Preference Optimization for Long-Form Video Understanding},
113
+ author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Wang, Zeyu and Yeung-Levy, Serena},
114
  journal={arXiv preprint arXiv:2501.13919},
115
  year={2025}
116
  }