| # VLAlert β Code & Models |
|
|
| Source code for **VLAlert**, a vision-language driver-alerting framework that |
| produces structured per-frame safety `<|BELIEF|>` tokens from dashcam video and |
| maps them to three alert actions: **SILENT / OBSERVE / ALERT**. |
|
|
| This repository contains the **training and evaluation code** for all model |
| variants. Model weights / checkpoints are **not** included. The benchmark data |
| and experimental results are hosted separately at |
| [`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench). |
|
|
| ## Architecture |
|
|
| ``` |
| 8 dashcam frames |
| β |
| βΌ |
| Qwen3-VL-4B + LoRA βββΊ [Analysis] reasoning + [Safety Assessment] |
| <|BELIEF|> ... </|BELIEF|> <|ACTION|> (per frame) |
| β |
| ββ belief span (mean-pool layers {20,24,28,32}) β z_t β β^10240 ββΊ DangerHead (14.8M) |
| ββ close-tag hidden state (layer 33) β r_t β β^2560 ββΊ PolicyHead (7.0M) |
| β |
| a_{t-1} feedback βββββ FSM Decoder βββΊ Action a_t |
| ``` |
|
|
| ## Repository Structure |
|
|
| ``` |
| lkalert/ |
| models/ # model architectures |
| danger_head.py # per-frame + clip danger regressor (PMA aggregator) |
| policy_head_v2.py # GRU 3-class policy head (SILENT/OBSERVE/ALERT) |
| adaptive_window.py # adaptive temporal-window selection (VLAlert-X) |
| components.py # MultiQueryPMA aggregator, legacy heads |
| belief_vlm.py # integrated VLM + belief/action heads |
| multichannel_belief.py # LKAlert-MCB gated multi-channel fusion |
| lora.py # LoRA implementation |
| utils/, data/ # core library |
| |
| training/ |
| VLA/ # belief-token SFT on Qwen3-VL-4B |
| train_cot_belief_v2.py # v2 SFT (belief + action per frame) |
| train_vlalert_sft_v3.py# v3 SFT (reasoning β belief, embedding loss option) |
| cot_belief_dataset_v2.py |
| Policy/ # downstream head training |
| train_danger_head.py # DangerHead (5-seed) |
| train_policy_head_v2.py# PolicyHead (5-seed) |
| train_vlalert_x.py # VLAlert-X adaptive-window end-to-end |
| train_head_dpo.py # DPO preference fine-tuning |
| train_head_kto.py # KTO fine-tuning |
| train_head_ppo.py # PPO fine-tuning |
| SFT/ # Qwen2.5-VL-3B monolithic SFT (VLAlert-2.5) |
| DPO/ # preference-pair training |
| pretrain*/ # 2-stage vision-language pretraining |
| Nexar/ # CNN baselines (ResNet50-LSTM, R3D-18, MViT-V2-S) |
| |
| tools/ |
| # data preparation |
| relabel_dada_nexar.py # action labels via risky_time + 2s rule |
| relabel_dota_corpus.py # BADAS-gated OBSERVE labels |
| generate_beliefs.py # rule-based belief content |
| run_v1_gpt5_cot.py # GPT-4o belief generation |
| build_v5_benchmark.py # unified benchmark builder |
| # belief cache extraction |
| make_cache_x_v2.py # dual-stream cache (belief_content + policy_position) |
| run_qwen3_cache_fast.py # cache extraction with Conv3dβLinear patch |
| # evaluation |
| demo_compare_pipeline.py # multi-model demo scoring + visualization |
| score_*.py, compute_daus_v6.py |
| # figures |
| render_modelarchi_v4.py, render_belief_span.py |
| |
| PATCH_conv3d_linear.md # Conv3dβLinear acceleration (64Γ on Blackwell GPUs) |
| requirements.txt |
| ``` |
|
|
| ## The Conv3d β Linear Patch |
|
|
| `PATCH_conv3d_linear.md` documents a 64Γ end-to-end speedup of Qwen3-VL vision |
| patch embedding on Blackwell GPUs (RTX 5090), by replacing the degenerate |
| `nn.Conv3d(kernel=stride)` patchification with a mathematically equivalent |
| `nn.Linear`. This makes large-scale belief-cache extraction feasible |
| (6 days β ~2 hours). Equivalence is proven and verified |
| (`tools/verify_patch_embed_correctness.py`). |
|
|
| ## Reproduction |
|
|
| 1. Prepare benchmark annotations from |
| [`AsianPlayer/VLAlert-Bench`](https://huggingface.co/datasets/AsianPlayer/VLAlert-Bench). |
| 2. **Stage 1 β SFT**: `training/VLA/train_vlalert_sft_v3.py` |
| 3. **Stage 2 β cache extraction**: `tools/make_cache_x_v2.py` |
| 4. **Stage 3 β heads**: `training/Policy/train_danger_head.py`, `train_policy_head_v2.py` |
| 5. **Evaluation**: `tools/score_*.py`, `tools/compute_daus_v6.py` |
|
|
| Paths in scripts use `PROJECT_ROOT` as a placeholder for the repository root. |
|
|
| ## License |
|
|
| Code released for research review. The benchmark builds on Nexar, DADA-2000, |
| DoTA, and DAD source datasets; see the dataset repository for source licenses |
| and citations. |
|
|