TRASER:
TRASER is the video scene graph generation model introduced in Synthetic Visual Genome 2 (SVG2). Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
Paper: Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Authors: Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI Β· University of Washington Β· Woven by Toyota)
Model Architecture
TRASER extends Qwen2.5-VL-3B-Instruct with two trainable Perceiver Resampler modules that implement Trajectory-Aligned Token Arrangement:
| Module | Abbrev. | Role |
|---|---|---|
| Object-Trajectory Resampler | OTR | Aggregates all cross-frame tokens for one object into a global summary |
| Temporal-Windows Resampler | TWR | Compresses per-object tokens within each temporal window into a fixed set of latents |
For each tracked object the LLM sees a structured token block:
<obj_traj_start> Object N: <|vision_start|> [OTR: N latents] <t1-t2> [TWR: N latents] <t2-t3> [TWR: N latents] ... <|vision_end|> <obj_traj_end>
How to Get Started
Installation
pip install transformers>=4.54.0 torch pycocotools
Prepare Inputs
Two inputs are required alongside the video:
- Video β any format supported by
qwen_vl_utils(e.g..mp4) - Mask JSON β per-frame, per-object RLE segmentation masks in COCO
pycocotoolsformat:
[
// frame 0
[{"size": [H, W], "counts": "..."}, {"size": [H, W], "counts": "..."}, ...],
// frame 1
[...]
]
See example/2401075277_rle.json for a complete example.
Run Inference
python inference.py \
--model_path /path/to/vsg_release_model \
--video_path /path/to/video.mp4 \
--mask_path /path/to/masks.json \
--out_dir ./output
CLI Arguments
| Argument | Default | Description |
|---|---|---|
--model_path |
required | Path to this model directory |
--video_path |
required | Input video file |
--mask_path |
required | Per-object RLE mask JSON |
--out_dir |
./output |
Directory to write output.txt |
--max_objects |
40 |
Maximum number of objects to process per video |
Quickstart with the Bundled Example
python inference.py \
--model_path . \
--video_path example/2401075277.mp4 \
--mask_path example/2401075277_rle.json \
--out_dir ./output
Python API
import torch
from transformers import AutoProcessor, AutoTokenizer
from modeling_traser import TRASER
model_path = "/path/to/vsg_release_model"
device = "cuda"
model = TRASER.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
processor.tokenizer = AutoTokenizer.from_pretrained(model_path)
Then follow the preprocessing steps in inference.py: load masks β build object mask tensors β select_tokens β rearrange_token β model.generate.
Repository Structure
βββ modeling_traser.py # TRASER model class
βββ inference.py # End-to-end inference script
βββ config.json # Model configuration
βββ generation_config.json # Default generation hyperparameters
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ model.safetensors.index.json
βββ tokenizer_config.json
βββ vocab.json
βββ merges.txt
βββ added_tokens.json
βββ special_tokens_map.json
βββ chat_template.jinja
βββ resampler_utils/
β βββ token_selection.py # Mask-based visual token selection (coverage threshold)
β βββ token_arrangement.py # Token sequence rearrangement with OTR/TWR injection
βββ qwen_vl_vsg_utils/ # Adapted Qwen-VL video processing utilities
βββ static/
β βββ image.png # Architecture diagram
βββ example/
βββ 2401075277.mp4 # Example video
βββ 2401075277_rle.json # Example RLE segmentation masks
Training Data
TRASER is trained on SVG2, a large-scale automatically annotated video scene graph dataset:
- ~636K videos with dense panoptic, per-frame annotations
- ~6.6M objects Β· ~52M attributes Β· ~6.7M relations
Citation
@article{gao2026svg2,
author = {Gao, Ziqi and Zhang, Jieyu and others},
title = {Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos},
year = {2026}
}
- Downloads last month
- 39
Model tree for UWGZQ/TRASER
Base model
Qwen/Qwen2.5-VL-3B-Instruct