PDF Invoice Parser β€” Fine-tuned LayoutLMv3

A fine-tuned LayoutLMv3 model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.

Model Details

  • Base model: microsoft/layoutlmv3-base
  • Architecture: LayoutLMv3ForTokenClassification
  • Task: Token classification (NER)
  • Fine-tuned on: Labeled PDF invoice pages

Labels

Label Description
B/I-INVOICE_NUM Invoice number
B/I-INVOICE_DATE Invoice date
B/I-DUE_DATE Payment due date
B/I-VENDOR_NAME Vendor / seller name
B/I-VENDOR_ADDR Vendor address
B/I-CUST_NAME Customer / buyer name
B/I-CUST_ADDR Customer address
B/I-TOTAL Total amount
B/I-SUBTOTAL Subtotal amount
B/I-TAX Tax amount
O Outside / no entity

Quick Start

pip install transformers torch Pillow
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
import torch
from PIL import Image

processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
model.eval()

# words and boxes come from your OCR tool (e.g. pytesseract)
encoding = processor(
    image,          # PIL.Image of the invoice page
    words,          # list of word strings
    boxes=boxes,    # list of [x0, y0, x1, y1] normalized to 0–1000
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=512,
)

with torch.no_grad():
    outputs = model(**encoding)

predictions = outputs.logits.argmax(-1).squeeze().tolist()
id2label = model.config.id2label
predicted_labels = [id2label[p] for p in predictions]

Full Pipeline (PDF β†’ JSON)

from invoice_parser import InvoiceParser

parser = InvoiceParser(strategy="finetuned")
result = parser.parse("invoice.pdf")
print(result.to_json())

Output Format

{
  "invoice_number": "INV-2024-0042",
  "invoice_date": "March 15, 2024",
  "due_date": "April 15, 2024",
  "vendor_name": "Acme Corp",
  "vendor_address": "123 Business St, City",
  "customer_name": "Client LLC",
  "customer_address": "456 Client Ave, Town",
  "subtotal": 1200.00,
  "tax": 216.00,
  "total": 1416.00
}

Extraction Strategies (invoice_parser.py)

Strategy Speed Accuracy Best For
pdfplumber Fast Good Digital/typed PDFs
ocr Moderate Good Scanned PDFs
finetuned Moderate Very Good Complex layouts (this model)
claude Moderate Excellent Any PDF (needs API key)

Training

Fine-tuned using train_model.py on labeled invoice annotations produced by label_invoices.py.

python train_model.py --annotations annotations/ --output trained_model/ --epochs 15

License

MIT

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Kapilydv6/layoutlmv3-invoice-parser

Finetuned
(296)
this model