| | --- |
| | datasets: |
| | - letxbe/BoundingDocs |
| | language: |
| | - en |
| | pipeline_tag: visual-question-answering |
| | tags: |
| | - Visual-Question-Answering |
| | - Question-Answering |
| | - Document |
| | license: apache-2.0 |
| | --- |
| | |
| |
|
| | <div align="center"> |
| |
|
| | <h1>DocExplainer: Document VQA with Bounding Box Localization</h1> |
| |
|
| | </div> |
| |
|
| | DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization. |
| | Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable. |
| | It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding. |
| | |
| | - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai |
| | - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it) |
| | - **License:** apache-2.0 |
| | - **Paper:** ["Towards Reliable and Interpretable Document Question Answering via VLMs"](https://arxiv.org/abs/2509.10129) by Alessio Chen et al. |
| |
|
| | <div align="center"> |
| | <img src="https://cdn.prod.website-files.com/655f447668b4ad1dd3d4b3d9/664cc272c3e176608bc14a4c_LOGO%20v0%20-%20LetXBebicolore.svg" alt="letxbe ai logo" width="200"> |
| | <img src="https://www.dinfo.unifi.it/upload/notizie/Logo_Dinfo_web%20(1).png" alt="Logo Unifi" width="200"> |
| | </div> |
| |
|
| |
|
| | ## Model Details |
| |
|
| | DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process: |
| |
|
| | 1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question. |
| | 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence. |
| |
|
| |
|
| | ## Model Architecture |
| | DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings. |
| |
|
| |  |
| |
|
| | ## Training Procedure |
| | - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers. |
| | - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`. |
| | - **Backbone**: SigLiP2 Giant (frozen). |
| | - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1]. |
| |
|
| | #### Training Setup |
| | - **Dataset**: [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) |
| | - **Epochs**: 20 |
| | - **Optimizer**: AdamW |
| | - **Hardware**: 1 × NVIDIA L40S-1-48G GPU |
| | - **Model Selection**: Best checkpoint chosen by highest mean IoU on the validation split. |
| |
|
| |
|
| |
|
| | ## Quick Start |
| |
|
| | Here is a simple example of how to use `DocExplainer` to get an answer and its corresponding bounding box from a document image. |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | import torch |
| | from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor |
| | import json |
| | |
| | url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg" |
| | image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
| | question = "What is the invoice number?" |
| | |
| | # ----------------------- |
| | # 1. Load SmolVLM2-2.2B for answer generation |
| | # ----------------------- |
| | vlm_model = AutoModelForImageTextToText.from_pretrained( |
| | "HuggingFaceTB/SmolVLM2-2.2B-Instruct", |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto", |
| | attn_implementation="flash_attention_2" |
| | ) |
| | processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") |
| | |
| | PROMPT = """Based only on the document image, answer the following question: |
| | Question: {QUESTION} |
| | Provide ONLY a JSON response in the following format (no trailing commas!): |
| | {{ |
| | "content": "answer" |
| | }} |
| | """ |
| | |
| | prompt_text = PROMPT.format(QUESTION=question) |
| | |
| | messages = [ |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": image}, |
| | {"type": "text", "text": prompt_text}, |
| | ] |
| | }, |
| | ] |
| | |
| | inputs = processor.apply_chat_template( |
| | messages, |
| | add_generation_prompt=True, |
| | tokenize=True, |
| | return_dict=True, |
| | return_tensors="pt", |
| | ).to(vlm_model.device, dtype=torch.bfloat16) |
| | |
| | input_length = inputs['input_ids'].shape[1] |
| | generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056) |
| | |
| | output_ids = generated_ids[:, input_length:] |
| | generated_texts = processor.batch_decode( |
| | output_ids, |
| | skip_special_tokens=True, |
| | ) |
| | |
| | decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip() |
| | answer = json.loads(decoded_output)['content'] |
| | |
| | print(f"Answer: {answer}") |
| | |
| | # ----------------------- |
| | # 2. Load DocExplainer for bounding box prediction |
| | # ----------------------- |
| | explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True) |
| | bbox = explainer.predict(image, answer) |
| | print(f"Predicted bounding box (normalized): {bbox}") |
| | ``` |
| |
|
| |
|
| | <table> |
| | <tr> |
| | <td width="50%" valign="top"> |
| | Example Output: |
| | |
| | **Question**: What is the invoice number? <br> |
| | **Answer**: 3Y8M2d-846<br><br> |
| | **Predicted BBox**: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606] <br> |
| | </td> |
| | <td width="50%" valign="top"> |
| | Visualized Answer Location: |
| | <img src="https://i.postimg.cc/0NmBM0b1/invoice-explained.png" alt="Invoice with predicted bounding box" width="100%"> |
| | </td> |
| | </tr> |
| | </table> |
| | |
| |
|
| | ## Performance |
| |
|
| | | Architecture | Prompting | ANLS | MeanIoU | |
| | |--------------------------------|------------|-------|---------| |
| | | Smolvlm-2.2B | Zero-shot | 0.527 | 0.011 | |
| | | | Anchors | 0.543 | 0.026 | |
| | | | CoT | 0.561 | 0.011 | |
| | | Qwen2-vl-7B | Zero-shot | 0.691 | 0.048 | |
| | | | Anchors | 0.694 | 0.051 | |
| | | | CoT | <ins>0.720</ins> | 0.038 | |
| | | Claude Sonnet 4 | Zero-shot | **0.737** | 0.031 | |
| | | Smolvlm-2.2B + DocExplainer | Zero-shot | 0.572 | 0.175 | |
| | | Qwen2-vl-7B + DocExplainer | Zero-shot | 0.689 | 0.188 | |
| | | Smol + Naive OCR | Zero-shot | 0.556 | <ins>0.405</ins> | |
| | | Qwen + Naive OCR | Zero-shot | 0.690 | **0.494** | |
| |
|
| |
|
| | Document VQA performance of different models and prompting strategies on the [BoundingDocs v2.0 dataset](https://huggingface.co/datasets/letxbe/BoundingDocs). <br> |
| | The best value is shown in **bold**, the second-best value is <ins>underlined</ins>. |
| |
|
| | ## Citation |
| |
|
| | If you use `DocExplainer`, please cite: |
| |
|
| | ```bibtex |
| | @misc{chen2025reliableinterpretabledocumentquestion, |
| | title={Towards Reliable and Interpretable Document Question Answering via VLMs}, |
| | author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai}, |
| | year={2025}, |
| | eprint={2509.10129}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2509.10129}, |
| | } |
| | ``` |
| |
|
| | ## Limitations |
| | - **Prototype only**: Intended as a first approach, not a production-ready solution. |
| | - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly |
| |
|
| |
|
| |
|
| |
|