Improve LLaVA-Video-7B-Qwen2-TPO model card with abstract and updated links
Browse filesThis PR improves the model card for LLaVA-Video-7B-Qwen2-TPO by:
- **Adding the paper abstract** for better context and understanding of the model's capabilities.
- **Updating the paper link** within the model description to point to the canonical Hugging Face paper page (`https://huggingface.co/papers/2501.13919`). The existing arXiv badge is retained.
- **Removing redundant "Project page" and "Code" links** as these are already prominently displayed via badges at the top of the model card.
- **Updating the BibTeX citation** to include an additional author found in the project's GitHub README, ensuring the citation is as complete and accurate as possible.
README.md
CHANGED
|
@@ -10,19 +10,20 @@ pipeline_tag: video-text-to-text
|
|
| 10 |
<img src="cvpr_figure_TPO.png"></img>
|
| 11 |
# LLaVA-Video-7B-Qwen2-TPO
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
|
| 15 |
benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
|
| 16 |
Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
|
| 17 |
|
| 18 |
-
Project page: https://ruili33.github.io/tpo_website/
|
| 19 |
-
Code: https://github.com/ruili33/TPO
|
| 20 |
-
|
| 21 |
## Evaluation Results
|
| 22 |
| **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
|
| 23 |
|-------------------------------------|----------|---------------------|----------|----------------------|
|
| 24 |
| **NVILA [1]** | 7B | 57.7 | 70.1 | 64.2/70.0 |
|
| 25 |
-
| **LLaVA-Video-7B [2]** | 7B | 58.2 | 70.8 | 63.3/69.7
|
| 26 |
| **LLaVA-Video-7B-Qwen2-TPO** | 7B | **60.1** | **71.1** | **65.6/71.5** |
|
| 27 |
|
| 28 |
|
|
@@ -76,7 +77,8 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
|
|
| 76 |
video = [video]
|
| 77 |
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
| 78 |
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
|
| 79 |
-
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}
|
|
|
|
| 80 |
conv = copy.deepcopy(conv_templates[conv_template])
|
| 81 |
conv.append_message(conv.roles[0], question)
|
| 82 |
conv.append_message(conv.roles[1], None)
|
|
@@ -108,7 +110,7 @@ This project utilizes certain datasets and checkpoints that are subject to their
|
|
| 108 |
```
|
| 109 |
@article{li2025temporal,
|
| 110 |
title={Temporal Preference Optimization for Long-Form Video Understanding},
|
| 111 |
-
author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
|
| 112 |
journal={arXiv preprint arXiv:2501.13919},
|
| 113 |
year={2025}
|
| 114 |
}
|
|
|
|
| 10 |
<img src="cvpr_figure_TPO.png"></img>
|
| 11 |
# LLaVA-Video-7B-Qwen2-TPO
|
| 12 |
|
| 13 |
+
**Abstract**
|
| 14 |
+
|
| 15 |
+
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding.
|
| 16 |
+
|
| 17 |
+
LLaVA-Video-7B-Qwen2-TPO, introduced by paper [Temporal Preference Optimization for Long-form Video Understanding](https://huggingface.co/papers/2501.13919), optimized
|
| 18 |
by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of
|
| 19 |
benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B.
|
| 20 |
Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.
|
| 21 |
|
|
|
|
|
|
|
|
|
|
| 22 |
## Evaluation Results
|
| 23 |
| **Model** | **Size** | **LongVideoBench** | **MLVU** | **VideoMME (Average)** |
|
| 24 |
|-------------------------------------|----------|---------------------|----------|----------------------|
|
| 25 |
| **NVILA [1]** | 7B | 57.7 | 70.1 | 64.2/70.0 |
|
| 26 |
+
| **LLaVA-Video-7B [2]** | 7B | 58.2 | 70.8 | 63.3/69.7 |\
|
| 27 |
| **LLaVA-Video-7B-Qwen2-TPO** | 7B | **60.1** | **71.1** | **65.6/71.5** |
|
| 28 |
|
| 29 |
|
|
|
|
| 77 |
video = [video]
|
| 78 |
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
| 79 |
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
|
| 80 |
+
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}
|
| 81 |
+
Please describe this video in detail."
|
| 82 |
conv = copy.deepcopy(conv_templates[conv_template])
|
| 83 |
conv.append_message(conv.roles[0], question)
|
| 84 |
conv.append_message(conv.roles[1], None)
|
|
|
|
| 110 |
```
|
| 111 |
@article{li2025temporal,
|
| 112 |
title={Temporal Preference Optimization for Long-Form Video Understanding},
|
| 113 |
+
author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Wang, Zeyu and Yeung-Levy, Serena},
|
| 114 |
journal={arXiv preprint arXiv:2501.13919},
|
| 115 |
year={2025}
|
| 116 |
}
|