VehicleDINO
Unified multi-task vehicle recognition model โ detection, type classification, make/model identification, re-identification, and license plate OCR in a single forward pass.
Architecture
- Backbone: DINOv2 ViT-B/14 (frozen, with LoRA adapters)
- Neck: SimpleFPN (768 -> 256) + HybridEncoder (AIFI + CCFM)
- Decoder: RT-DETR-style with 300 detection queries + 1 global attribute query
- Heads: 6 task-specific heads (det, type, make, model, Re-ID, OCR)
Model Variants
| File | Format | Size | Notes |
|---|---|---|---|
vehicledino_dinov2.onnx |
FP32 | 450 MB | Full precision |
vehicledino_dinov2_int8.onnx |
INT8 | 139 MB | Quantized, 3.2x smaller |
Input / Output
Input: images โ float32 tensor (1, 3, 560, 560), ImageNet-normalized RGB
Outputs:
| Tensor | Shape | Description |
|---|---|---|
det_boxes |
(1, 300, 4) | Detection boxes (cx, cy, w, h normalized) |
det_classes |
(1, 300, 5) | Detection class logits (car, suv, truck, bus, van) |
vehicle_types |
(1, 1, 8) | Vehicle type logits |
makes |
(1, 1, 42) | Make classification logits |
models |
(1, 1, 323) | Model classification logits |
reid_embeds |
(1, 1, 256) | L2-normalized Re-ID embedding |
ocr_logits |
(1, 1, 8, 37) | License plate OCR logits (8 positions, 37 chars) |
Performance (Test Set)
| Task | Metric | Score |
|---|---|---|
| Type Classification | Top-1 Accuracy | 95.6% |
| Make Classification | Top-1 Accuracy | 98.4% |
| Model Classification | Top-1 Accuracy | 87.7% |
| Re-ID (VeRi-776) | mAP | 61.1% |
| Re-ID (VeRi-776) | Rank-1 | 86.1% |
Training Data
- Detection + Type + Re-ID: VeRi-776 (776 vehicles, 49,360 images)
- Make/Model: CompCars (42 makes, 323 models)
- OCR: CCPD-Green (Chinese license plates)
Usage with ONNX Runtime (Python)
import onnxruntime as ort
import numpy as np
from PIL import Image
session = ort.InferenceSession("vehicledino_dinov2_int8.onnx")
# Preprocess: resize to 560x560, ImageNet normalize
img = Image.open("car.jpg").resize((560, 560))
arr = np.array(img).astype(np.float32) / 255.0
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
arr = (arr - mean) / std
tensor = arr.transpose(2, 0, 1)[np.newaxis] # (1, 3, 560, 560)
outputs = session.run(None, {"images": tensor.astype(np.float32)})
Usage in Browser (ONNX Runtime Web)
The INT8 model runs in the browser via ONNX Runtime Web with WebGPU or WASM backend.
Live demo: https://yolov11-plate-recognition.swmengappdev.workers.dev
Citation
@article{vehicledino2026,
title={VehicleDINO: Unified Multi-Task Vehicle Recognition via DINOv2 Features},
author={Soh, Wei Meng},
year={2026}
}
License
Apache 2.0