OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Abstract
OmniSIFT is a modality-asymmetric token compression framework for Omni-LLMs that reduces computational overhead through spatio-temporal video pruning and vision-guided audio selection while maintaining superior performance.
Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs (2025)
- ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models (2026)
- OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding (2025)
- VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration (2026)
- ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning (2026)
- FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models (2025)
- FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
