VLAlert / README.md

Add VLAlert code

1e05592 verified 15 days ago

4.61 kB

	# VLAlert — Code & Models

	Source code for VLAlert, a vision-language driver-alerting framework that
	produces structured per-frame safety `<\|BELIEF\|>` tokens from dashcam video and
	maps them to three alert actions: SILENT / OBSERVE / ALERT.

	This repository contains the training and evaluation code for all model
	variants. Model weights / checkpoints are not included. The benchmark data
	and experimental results are hosted separately at
	[`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).

	## Architecture

	```
	8 dashcam frames
	│
	▼
	Qwen3-VL-4B + LoRA ──► [Analysis] reasoning + [Safety Assessment]
	<\|BELIEF\|> ... </\|BELIEF\|> <\|ACTION\|> (per frame)
	│
	├─ belief span (mean-pool layers {20,24,28,32}) → z_t ∈ ℝ^10240 ─► DangerHead (14.8M)
	└─ close-tag hidden state (layer 33) → r_t ∈ ℝ^2560 ─► PolicyHead (7.0M)
	│
	a_{t-1} feedback ◄──── FSM Decoder ──► Action a_t
	```

	## Repository Structure

	```
	lkalert/
	models/ # model architectures
	danger_head.py # per-frame + clip danger regressor (PMA aggregator)
	policy_head_v2.py # GRU 3-class policy head (SILENT/OBSERVE/ALERT)
	adaptive_window.py # adaptive temporal-window selection (VLAlert-X)
	components.py # MultiQueryPMA aggregator, legacy heads
	belief_vlm.py # integrated VLM + belief/action heads
	multichannel_belief.py # LKAlert-MCB gated multi-channel fusion
	lora.py # LoRA implementation
	utils/, data/ # core library

	training/
	VLA/ # belief-token SFT on Qwen3-VL-4B
	train_cot_belief_v2.py # v2 SFT (belief + action per frame)
	train_vlalert_sft_v3.py# v3 SFT (reasoning → belief, embedding loss option)
	cot_belief_dataset_v2.py
	Policy/ # downstream head training
	train_danger_head.py # DangerHead (5-seed)
	train_policy_head_v2.py# PolicyHead (5-seed)
	train_vlalert_x.py # VLAlert-X adaptive-window end-to-end
	train_head_dpo.py # DPO preference fine-tuning
	train_head_kto.py # KTO fine-tuning
	train_head_ppo.py # PPO fine-tuning
	SFT/ # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5)
	DPO/ # preference-pair training
	pretrain*/ # 2-stage vision-language pretraining
	Nexar/ # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S)

	tools/
	# data preparation
	relabel_dada_nexar.py # action labels via risky_time + 2s rule
	relabel_dota_corpus.py # BADAS-gated OBSERVE labels
	generate_beliefs.py # rule-based belief content
	run_v1_gpt5_cot.py # GPT-4o belief generation
	build_v5_benchmark.py # unified benchmark builder
	# belief cache extraction
	make_cache_x_v2.py # dual-stream cache (belief_content + policy_position)
	run_qwen3_cache_fast.py # cache extraction with Conv3d→Linear patch
	# evaluation
	demo_compare_pipeline.py # multi-model demo scoring + visualization
	score_*.py, compute_daus_v6.py
	# figures
	render_modelarchi_v4.py, render_belief_span.py

	PATCH_conv3d_linear.md # Conv3d→Linear acceleration (64× on Blackwell GPUs)
	requirements.txt
	```

	## The Conv3d → Linear Patch

	`PATCH_conv3d_linear.md` documents a 64× end-to-end speedup of Qwen3-VL vision
	patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate
	`nn.Conv3d(kernel=stride)` patchification with a mathematically equivalent
	`nn.Linear`. This makes large-scale belief-cache extraction feasible
	(6 days → ~2 hours). Equivalence is proven and verified
	(`tools/verify_patch_embed_correctness.py`).

	## Reproduction

	1. Prepare benchmark annotations from
	[`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench).
	2. Stage 1 — SFT: `training/VLA/train_vlalert_sft_v3.py`
	3. Stage 2 — cache extraction: `tools/make_cache_x_v2.py`
	4. Stage 3 — heads: `training/Policy/train_danger_head.py`, `train_policy_head_v2.py`
	5. Evaluation: `tools/score_*.py`, `tools/compute_daus_v6.py`

	Paths in scripts use `PROJECT_ROOT` as a placeholder for the repository root.

	## License

	Code released for research review. The benchmark builds on Nexar, DADA-2000,
	DoTA, and DAD source datasets; see the dataset repository for source licenses
	and citations.