Event Graph Generation — ViT-G

Model Overview

動画から構造化されたイベントグラフを予測するモデルです。動画中の「誰が・何を・どこから・どこへ」を構造化 JSON として出力します。

テキスト生成ではなく、DETR 風のセット予測（Hungarian Matching）により、イベントグラフを直接出力する設計です。

Base Model

V-JEPA 2.1 ViT-G (vjepa2_1_vit_giant_384) を映像特徴抽出のバックボーンとして使用しています。V-JEPA は Meta が開発した自己教師あり映像表現学習モデルで、時空間トークンを出力します。

本モデル（Event Decoder 部分）は V-JEPA の出力トークンを入力とし、Object Pooling と Event Decoder の 2 段階でイベントグラフを予測します。V-JEPA 自体の重みは本チェックポイントに含まれません（別途 PyTorch Hub からロードされます）。

Model Details

項目	値
パラメータ数 (Event Decoder)	15.1M
V-JEPA backbone	`vjepa2_1_vit_giant_384`
V-JEPA hidden_size	1408
Object Pooling	Slot Attention (K=24 slots, 3 iterations)
Event Decoder	DETR 風 cross-attention (M=20 event queries)
d_model	256
Action classes	13
Best epoch	41

Architecture

Video → V-JEPA 2.1 ViT-G → spatiotemporal tokens (B, S, 1408)
  → ObjectPoolingModule (Slot Attention, K=24 slots)
    → ObjectRepresentation (identity, trajectory, existence, categories)
      → VJEPAEventDecoder (M=20 event queries, cross-attention)
        → 7 Prediction Heads → EventGraph JSON

Prediction Heads

Head	Shape	Description
interaction	(M, 1)	イベントが有効か (BCE)
action	(M, 13)	アクション分類 (13 クラス)
agent_ptr	(M, K)	行為者スロットへのポインタ
target_ptr	(M, K)	対象スロットへのポインタ
source_ptr	(M, K+1)	取り出し元へのポインタ (最後 = "none")
dest_ptr	(M, K+1)	格納先へのポインタ (最後 = "none")
frame	(M, T)	イベント発生フレーム

Training

Fine-tuning

ファインチューニング: あり（V-JEPA backbone は frozen、Event Decoder 部分のみ学習）
学習手法: Full fine-tuning（Event Decoder 全体を学習。LoRA 等は未使用）
Optimizer: adamw (lr=0.0001, weight_decay=0.0001)
Scheduler: cosine_warmup (warmup 5 epochs)
Early stopping: patience=15
AMP: bfloat16 mixed precision
損失関数: Hungarian Matching によるセット予測損失

Loss weights:

Component	Weight
interaction	2.0
action	1.0
agent_ptr	1.0
target_ptr	1.0
source_ptr	0.5
dest_ptr	0.5
frame	0.5

Training Data

データ: 室内環境の録画動画（デスク・キッチン・部屋）
アノテーション: Qwen 3.5 VLM による合成アノテーション（人手ラベルなし）
フレームレート: 1 FPS でサンプリング、16 フレーム/クリップ、50% オーバーラップ
オブジェクトカテゴリ: 28 カテゴリ + unknown（person, hand, chair, desk, laptop, monitor 等）
アクション語彙: 13 クラス（take_out, put_in, place_on, pick_up, hand_over, open, close, use, move, attach, detach, inspect, no_event）

Intended Use

Intended Use Cases

製造・組立作業の動画からの作業イベント自動抽出
室内行動の構造的記録（誰が何をどこに置いたか等）
動画理解研究のための構造化アノテーション自動生成
IoT / スマートホーム環境での行動ログ生成

Out-of-Scope Use

個人の監視・追跡を目的とした利用
屋外・交通・医療など、学習データに含まれないドメインでの高精度な利用
リアルタイム処理が必要なシステム（V-JEPA backbone の推論コストが高い）
セキュリティ判断や法的判断の根拠としての利用

Evaluation

本モデルは以下のメトリクスで評価されています:

Metric	Description
event_detection_mAP	イベント検出の平均適合率
action_accuracy	アクション分類精度
pointer_accuracy	agent/target ポインタの正解率
frame_mae	イベントフレーム予測の平均絶対誤差
graph_f1	EventGraph 全体の F1 スコア

注意: 本モデルは VLM 合成アノテーションで学習されており、人手アノテーションによるベンチマークスコアは未計測です。

Inference

End-to-End Inference (推奨)

# リポジトリをクローン
git clone https://github.com/ChanYu1224/event-graph-generation.git
cd event-graph-generation
uv sync

# 推論実行
uv run python scripts/6_run_inference.py \
  --video your_video.mp4 \
  --checkpoint path/to/model.pt \
  --config configs/vjepa_training.yaml \
  --vjepa-config configs/vjepa.yaml \
  --output output/event_graph.json

Python API

import torch
from huggingface_hub import hf_hub_download

# Download model
model_path = hf_hub_download(repo_id="Yuchn/event-graph-vitg", filename="model.pt")
config_path = hf_hub_download(repo_id="Yuchn/event-graph-vitg", filename="config.yaml")

# Build model
from event_graph_generation.config import Config
from event_graph_generation.models.base import build_model

config = Config.from_yaml(config_path)
model = build_model(config.model, vjepa_config=config.vjepa)

state_dict = torch.load(model_path, map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

# Forward pass (vjepa_tokens: pre-extracted V-JEPA features)
# vjepa_tokens shape: (batch_size, num_tokens, hidden_size)
with torch.no_grad():
    obj_repr, predictions = model(vjepa_tokens)

Limitations

Bias

学習データは室内環境（デスク・キッチン・部屋）に限定されており、屋外や工場環境での精度は未検証
VLM 合成アノテーションに依存しているため、Qwen 3.5 のバイアスを継承する可能性がある
オブジェクトカテゴリは 28 種類に限定されており、未知カテゴリのオブジェクトは "unknown" として扱われる

Limitations

V-JEPA backbone は frozen のため、ドメイン固有の映像表現への適応は限定的
13 種類のアクション語彙に限定されており、語彙外の行動は検出できない
1 FPS サンプリングのため、1 秒未満の高速なイベントは見逃す可能性がある
推論には CUDA 対応 GPU が必要（V-JEPA backbone + Event Decoder）
長時間動画ではスライディングウィンドウ処理のため、メモリ使用量と処理時間が線形に増加

License

MIT License

Citation

@software{event_graph_generation_2026,
  title = {Event Graph Generation: Structured Event Prediction from Video},
  author = {Yuchn},
  year = {2026},
  url = {https://github.com/ChanYu1224/event-graph-generation},
  license = {MIT}
}

Yuchn
/

event-graph-vitg

Event Graph Generation — ViT-G

Model Overview

Base Model

Model Details

Architecture

Prediction Heads

Training

Fine-tuning

Training Data

Intended Use

Intended Use Cases

Out-of-Scope Use

Evaluation

Inference

End-to-End Inference (推奨)

Python API

Limitations

Bias

Limitations

License

Citation

Links