aircop-8b — Qwen3-VL-8B fine-tuned on AirCopBench

LoRA adapter for Qwen/Qwen3-VL-8B-Instruct, supervised fine-tuned on the training split of AirCopBench, a multi-UAV collaborative aerial perception VQA benchmark.

Paper: https://arxiv.org/pdf/2511.11025

Task

Each question shows the same scene captured at the same moment by 2–6 UAV cameras from different viewpoints, and asks a 4-way multiple-choice question (object grounding, counting, matching, causal/collaboration assessment, etc.). The model answers with a single option letter.

Results (AirCopBench test, 1025 questions)

Subset Accuracy
Overall 0.7532 (772/1025)
Real2 (2 real UAVs) 0.6099
Sim3 (3 sim UAVs) 0.8206
Sim5 (5 sim UAVs) 0.7415
Sim6 (6 sim UAVs) 0.7405

Parse failures: 0.

Training

  • Method: LoRA SFT (rank 16, lora_target: all), 1 epoch, bf16, flash-attn 2
  • Effective batch size 16 (per-device 8 × grad-accum 2), lr 1e-4 cosine, image_max_pixels 262144
  • Framework: LLaMA-Factory, template qwen3_vl_nothink
  • ~12.7k multi-image samples (Real2 / Sim3 / Sim5 / Sim6)

Usage

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

base = "Qwen/Qwen3-VL-8B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(base, dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "EasonFan/aircop-8b")
processor = AutoProcessor.from_pretrained(base)

messages = [{"role": "user", "content": [
    {"type": "text", "text": "UAV1:"}, {"type": "image"},
    {"type": "text", "text": "UAV2:"}, {"type": "image"},
    {"type": "text", "text": "Question: ...\nOptions:\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer with only the letter."},
]}]
# build inputs with processor.apply_chat_template + processor(...) and call model.generate()
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EasonFan/aircop-8b

Adapter
(118)
this model

Dataset used to train EasonFan/aircop-8b

Paper for EasonFan/aircop-8b