PDF Invoice Parser — Fine-tuned LayoutLMv3

A fine-tuned LayoutLMv3 model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.

Model Details

Base model: microsoft/layoutlmv3-base
Architecture: LayoutLMv3ForTokenClassification
Task: Token classification (NER)
Fine-tuned on: Labeled PDF invoice pages

Labels

Label	Description
`B/I-INVOICE_NUM`	Invoice number
`B/I-INVOICE_DATE`	Invoice date
`B/I-DUE_DATE`	Payment due date
`B/I-VENDOR_NAME`	Vendor / seller name
`B/I-VENDOR_ADDR`	Vendor address
`B/I-CUST_NAME`	Customer / buyer name
`B/I-CUST_ADDR`	Customer address
`B/I-TOTAL`	Total amount
`B/I-SUBTOTAL`	Subtotal amount
`B/I-TAX`	Tax amount
`O`	Outside / no entity

Quick Start

pip install transformers torch Pillow

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
import torch
from PIL import Image

processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
model.eval()

# words and boxes come from your OCR tool (e.g. pytesseract)
encoding = processor(
    image,          # PIL.Image of the invoice page
    words,          # list of word strings
    boxes=boxes,    # list of [x0, y0, x1, y1] normalized to 0–1000
    return_tensors="pt",
    truncation=True,
    padding="max_length",
    max_length=512,
)

with torch.no_grad():
    outputs = model(**encoding)

predictions = outputs.logits.argmax(-1).squeeze().tolist()
id2label = model.config.id2label
predicted_labels = [id2label[p] for p in predictions]

Full Pipeline (PDF → JSON)

from invoice_parser import InvoiceParser

parser = InvoiceParser(strategy="finetuned")
result = parser.parse("invoice.pdf")
print(result.to_json())

Output Format

{
  "invoice_number": "INV-2024-0042",
  "invoice_date": "March 15, 2024",
  "due_date": "April 15, 2024",
  "vendor_name": "Acme Corp",
  "vendor_address": "123 Business St, City",
  "customer_name": "Client LLC",
  "customer_address": "456 Client Ave, Town",
  "subtotal": 1200.00,
  "tax": 216.00,
  "total": 1416.00
}

Extraction Strategies (invoice_parser.py)

Strategy	Speed	Accuracy	Best For
`pdfplumber`	Fast	Good	Digital/typed PDFs
`ocr`	Moderate	Good	Scanned PDFs
`finetuned`	Moderate	Very Good	Complex layouts (this model)
`claude`	Moderate	Excellent	Any PDF (needs API key)

Training

Fine-tuned using train_model.py on labeled invoice annotations produced by label_invoices.py.

python train_model.py --annotations annotations/ --output trained_model/ --epochs 15

License

MIT

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Kapilydv6/layoutlmv3-invoice-parser

Base model

microsoft/layoutlmv3-base

Finetuned

(296)

this model