PDF Invoice Parser β Fine-tuned LayoutLMv3
A fine-tuned LayoutLMv3 model for named entity recognition (NER) on PDF invoices. It extracts structured fields such as invoice number, dates, vendor/customer details, and financial totals directly from document pages using text, layout (bounding boxes), and visual features.
Model Details
- Base model:
microsoft/layoutlmv3-base - Architecture:
LayoutLMv3ForTokenClassification - Task: Token classification (NER)
- Fine-tuned on: Labeled PDF invoice pages
Labels
| Label | Description |
|---|---|
B/I-INVOICE_NUM |
Invoice number |
B/I-INVOICE_DATE |
Invoice date |
B/I-DUE_DATE |
Payment due date |
B/I-VENDOR_NAME |
Vendor / seller name |
B/I-VENDOR_ADDR |
Vendor address |
B/I-CUST_NAME |
Customer / buyer name |
B/I-CUST_ADDR |
Customer address |
B/I-TOTAL |
Total amount |
B/I-SUBTOTAL |
Subtotal amount |
B/I-TAX |
Tax amount |
O |
Outside / no entity |
Quick Start
pip install transformers torch Pillow
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
import torch
from PIL import Image
processor = LayoutLMv3Processor.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser", apply_ocr=False)
model = LayoutLMv3ForTokenClassification.from_pretrained("Kapilydv6/layoutlmv3-invoice-parser")
model.eval()
# words and boxes come from your OCR tool (e.g. pytesseract)
encoding = processor(
image, # PIL.Image of the invoice page
words, # list of word strings
boxes=boxes, # list of [x0, y0, x1, y1] normalized to 0β1000
return_tensors="pt",
truncation=True,
padding="max_length",
max_length=512,
)
with torch.no_grad():
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
id2label = model.config.id2label
predicted_labels = [id2label[p] for p in predictions]
Full Pipeline (PDF β JSON)
from invoice_parser import InvoiceParser
parser = InvoiceParser(strategy="finetuned")
result = parser.parse("invoice.pdf")
print(result.to_json())
Output Format
{
"invoice_number": "INV-2024-0042",
"invoice_date": "March 15, 2024",
"due_date": "April 15, 2024",
"vendor_name": "Acme Corp",
"vendor_address": "123 Business St, City",
"customer_name": "Client LLC",
"customer_address": "456 Client Ave, Town",
"subtotal": 1200.00,
"tax": 216.00,
"total": 1416.00
}
Extraction Strategies (invoice_parser.py)
| Strategy | Speed | Accuracy | Best For |
|---|---|---|---|
pdfplumber |
Fast | Good | Digital/typed PDFs |
ocr |
Moderate | Good | Scanned PDFs |
finetuned |
Moderate | Very Good | Complex layouts (this model) |
claude |
Moderate | Excellent | Any PDF (needs API key) |
Training
Fine-tuned using train_model.py on labeled invoice annotations produced by label_invoices.py.
python train_model.py --annotations annotations/ --output trained_model/ --epochs 15
License
MIT
- Downloads last month
- -
Model tree for Kapilydv6/layoutlmv3-invoice-parser
Base model
microsoft/layoutlmv3-base