InternVideo3-8B-Instruct

Introduction

InternVideo3 is a multimodal large language model designed for long-horizon video understanding and agentic reasoning. It introduces Multimodal Contextual Reasoning (MCR), an efficient formulation that unifies perception, planning, tool use, self-reflection, and memory within a single shared context, enabling recursive multi-step reasoning over long videos.

Key Features

  • M²LA (Multimodal Multi-head Latent Attention): A KV-cache-efficient attention architecture that reduces memory footprint via low-rank latent factorization, enabling long-context reasoning (up to 256K tokens) without dropping tokens.
  • Long-Video Understanding: Trained with a short-to-long curriculum (up to 2048 frames at 4fps), supporting hour-long video comprehension.
  • Agentic Video Reasoning: Built-in support for recursive perception-action loops with tool use (temporal grounding, ASR, web search, video segmentation) and self-verification.
  • Advanced Post-Training: Combines rule-based group sequence policy optimization (R-GSPO) and on-policy distillation from Qwen3-235B for improved temporal reasoning.

Architecture

Component Details
Vision Encoder 27-layer ViT, hidden_size=1152, patch_size=16, temporal_patch_size=2
Language Model 36-layer, hidden_size=4096, 32 attention heads
KV Latent Rank 896 per layer
Max Context 262,144 tokens
Precision BFloat16

Quickstart

Requirements

pip install transformers>=4.57.3 torch qwen-vl-utils

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "OpenGVLab/InternVideo3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

Text-only Conversation

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "Please introduce yourself."}],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=None, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Video Understanding

video_path = "your_video.mp4"

fps = 1
min_pixels = 128 * 32 * 32
max_pixels = 128 * 32 * 32

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video_path, "fps": fps},
            {"type": "text", "text": "Please describe this video in detail."},
        ],
    }
]

processor.video_processor.size = {
    "longest_edge": max_pixels * max_frames,
    "shortest_edge": min_pixels * min_frames,
}

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    fps=fps,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Image Understanding

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "your_image.jpg"},
            {"type": "text", "text": "Please describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=images, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Training Pipeline

  1. Continued Pretraining (CPT): Recovers language ability and aligns vision features after M²LA conversion, using a mixture of text, image-text pairs, and video captions.
  2. Short-to-Long SFT: Two-stage curriculum — Stage 1 at 2fps/512 frames (32K tokens), Stage 2 at 4fps/2048 frames (256K tokens).
  3. R-GSPO: Rule-based reinforcement learning on temporal grounding (IoU reward) and video QA (correctness reward) to improve temporal reasoning.
  4. On-Policy Distillation: Transfers capabilities from Qwen3-235B on samples where the student underperforms, using reverse-KL on student-sampled trajectories.

Citation

@misc{yan2026internvideo3agentifyfoundationmodels,
      title={InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning}, 
      author={Ziang Yan and Sheng Xia and Jiashuo Yu and Yue Wu and Tianxiang Jiang and Songze Li and Kanghui Tian and Yicheng Xu and Yinan He and Kai Chen and Limin Wang and Yu Qiao and Yi Wang},
      year={2026},
      eprint={2606.12195},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.12195}, 
}
## License

This project is released under the Apache 2.0 License.
Downloads last month
105
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Paper for yanziang/InternVideo3-8B-Instruct