InvoiceGuard -- Three-Way Invoice Matching Environment

An OpenEnv environment that simulates accounts payable exception resolution. An AI agent investigates multi-document business cases -- invoices, purchase orders, goods receipt notes, vendor profiles, and company policies -- to detect discrepancies, classify exception types, and render correct decisions.

Motivation

Three-way invoice matching is one of the most common and error-prone tasks in enterprise finance. Accounts payable teams manually compare invoices against purchase orders and goods receipt notes to detect overbilling, partial shipments, duplicate submissions, and price variances. This environment turns that real-world workflow into a structured evaluation benchmark where an AI agent must gather evidence through sequential investigation actions and reach a correct, policy-compliant decision.

Tasks

Task ID	Description	Difficulty	Expected Decision	Exception Type
`task_1_clean_match`	All documents align within tolerance	Easy	`approve_for_payment`	`clean_match`
`task_2_partial_receipt`	Billed quantity exceeds received quantity	Moderate	`place_on_hold`	`partial_receipt`
`task_3_price_variance`	Unit price exceeds PO price beyond tolerance	Moderate	`escalate_for_supervisor_review`	`price_mismatch`
`task_4_duplicate_invoice`	Previously processed invoice resubmitted	Hard	`reject_invoice`	`duplicate_invoice`
`task_5_mixed_discrepancy`	Invoice with both price variance and partial receipt; conflicting signals	Hard	`escalate_for_supervisor_review`	`price_mismatch`
`task_6_false_positive_duplicate`	Invoice looks like a duplicate but is a legitimate recurring order for a different PO	Hard	`approve_for_payment`	`clean_match`
`task_7_retroactive_price`	Vendor applied a price increase retroactively; PO predates the effective date	Hard	`escalate_for_supervisor_review`	`price_mismatch`
`task_8_split_invoice_pattern`	Supplier splits large order into sub-threshold invoices to dodge auto-approval	Hard	`escalate_for_supervisor_review`	`policy_violation`
`task_9_clean_from_risky_vendor`	Clean invoice from high-risk vendor with 5 prior incidents -- false-positive trap	Hard	`approve_for_payment`	`clean_match`
`task_10_rounding_false_alarm`	Invoice total off by $0.01 due to line-item rounding -- all else matches perfectly	Hard	`approve_for_payment`	`clean_match`
`task_11_authorized_overship`	GRN shows 110 received vs 100 ordered, but PO amendment authorized 10% overship	Hard	`approve_for_payment`	`clean_match`
`task_12_corrected_resubmission`	Corrected invoice (INV-R1) looks like a duplicate of rejected original	Hard	`approve_for_payment`	`clean_match`

Each task includes fully synthetic business documents with deterministic ground truth and a multi-criteria grader. Tasks 5-8 test ambiguity, temporal reasoning, and cross-case pattern detection. Tasks 9-12 are false-positive traps where surface signals mislead toward rejection but deeper investigation reveals the correct answer is approval.

Action Space

The agent has 12 available actions divided into investigation, proposal, and terminal categories.

Investigation Actions (provide `action_type` only)

Action	Description
`inspect_invoice_line_items`	Reveal detailed invoice line items (codes, quantities, prices, totals)
`inspect_purchase_order`	Reveal purchase order details (ordered quantities, agreed prices)
`inspect_goods_receipt_note`	Reveal goods receipt note (received/accepted/rejected quantities)
`inspect_vendor_profile`	Reveal vendor risk tier, duplicate history, escalation thresholds
`inspect_policy_rules`	Reveal company matching tolerances and escalation rules
`check_for_duplicate_invoice`	Search case history for similar/processed invoices
`compare_quantity`	Compare billed vs ordered vs received quantities per line item
`compare_price`	Compare billed unit prices vs PO-agreed prices per line item
`compare_totals`	Verify subtotal consistency, PO total match, tax, and grand total
`summarize_findings`	Get a numbered summary of all collected findings

Proposal Action

Action	Description
`propose_exception_type`	Declare the suspected exception type (with `exception_type` field)

Terminal Action

Action	Required Fields	Description
`submit_final_resolution`	`final_decision`, `exception_type`, `evidence_references`, `explanation`	End the episode with a decision

Action JSON Format

{"action_type": "inspect_purchase_order"}

{
  "action_type": "submit_final_resolution",
  "final_decision": "escalate_for_supervisor_review",
  "exception_type": "price_mismatch",
  "evidence_references": ["inspect_purchase_order", "compare_price", "inspect_policy_rules"],
  "explanation": "Price variance of 10% exceeds 5% tolerance, requiring supervisor escalation per company policy."
}

Observation Space

Each step returns an InvoiceGuardObservation with these fields:

Field	Type	Description
`case_id`	`str`	Unique case identifier
`task_id`	`str`	Which task is being evaluated
`difficulty`	`str`	`easy`, `moderate`, or `hard`
`invoice_summary`	`str`	One-line invoice overview (supplier, amount, PO ref)
`goal`	`str`	Natural language description of the agent's objective
`available_actions`	`list[str]`	Actions the agent can take
`revealed_documents`	`list[str]`	Documents the agent has already inspected
`findings`	`list[str]`	Accumulated investigation findings
`remaining_steps`	`int`	Steps left before timeout
`last_action_result`	`str`	Detailed output from the most recent action
`last_action_error`	`bool`	Whether the last action had an error
`warnings`	`list[str]`	System warnings (e.g., low steps remaining)
`reward`	`float`	Reward signal for the last action
`done`	`bool`	Whether the episode has ended
`metadata`	`dict`	Grader results (on episode end)

Reward Design

The environment provides dense, per-step rewards:

Event	Reward
Reveal a new document	+0.05
Useful comparison finding discrepancy	+0.10
Confirm no issue (clean comparison)	+0.02
Propose correct exception type	+0.15
Propose wrong exception type	-0.05
Summarize findings	+0.03
Repeat an already-seen action	-0.02
Submit correct final decision	+0.30
Submit wrong final decision	-0.20
Correct exception type on resolution	+0.15

Grading

Episodes are scored by a deterministic grader on six weighted criteria (total = 1.0):

Criterion	Weight	Description
Decision correctness	0.35	Exact match = 1.0, partial credit for related decisions
Exception type	0.20	Correct classification of the exception
Evidence sufficiency	0.15	Did the agent inspect the right documents?
Investigation quality	0.10	Breadth of document review and findings
Explanation quality	0.10	Cites specific numbers, references policy, uses correct terminology
Efficiency	0.10	Completing within step budget without waste

Decisions

Decision	When to use
`approve_for_payment`	All matches are clean and within tolerance
`place_on_hold`	Billed quantity exceeds received quantity
`reject_invoice`	Duplicate invoice or fraudulent submission
`escalate_for_supervisor_review`	Price/total variance exceeds tolerance, high-value invoice

Setup & Usage

Prerequisites

Python 3.10+
uv (recommended) or pip
Docker (for containerized deployment)

Local Development

cd invoice_guard

# Install dependencies with uv
uv sync

# Start the server
uv run uvicorn server.app:app --host 0.0.0.0 --port 8000

# Validate
openenv validate

Running the Baseline Agent

cd invoice_guard

# Create .env from the example
cp .env.example .env
# Edit .env with your API key and model

# Run inference
uv run python inference.py

Docker

cd invoice_guard

# Build
docker build -t invoiceguard .

# Run
docker run -p 8000:8000 invoiceguard

# Run with hackathon resource constraints
docker run --cpus=2 --memory=8g -p 8000:8000 invoiceguard

# Validate against running container
openenv validate --url http://localhost:8000

Deploy to Hugging Face Spaces

cd invoice_guard
openenv push
# or: openenv push --namespace my-org --private

Baseline Scores (12 tasks)

Model	Avg Score	task_1	task_2	task_3	task_4	task_5	task_6	task_7	task_8	task_9	task_10	task_11	task_12
gpt-4.1-mini	0.87	0.95	0.78	0.75	0.95	0.78	0.95	0.75	0.75	0.95	0.95	0.98	0.95
gpt-5.4-mini	0.87	0.98	0.95	0.73	0.98	0.75	0.98	0.75	0.50	0.95	0.98	0.98	0.95
gpt-4.1	0.79	0.95	0.75	0.75	0.47	0.78	0.95	0.78	0.75	0.40	0.95	0.95	0.95
gpt-5.4	0.78	0.95	0.75	0.70	0.47	0.75	0.95	0.78	0.75	0.40	0.95	0.95	0.95

Key observations:

Task 9 (clean invoice from risky vendor) is a strong false-positive trap: both gpt-4.1 and gpt-5.4 escalated instead of approving, scoring only 0.40.
Task 8 (split invoice pattern) tripped gpt-5.4-mini, which rejected instead of escalating (0.50).
Task 4 (duplicate invoice) tripped both full-size models, which escalated instead of rejecting (0.47).
Mini models consistently outperform their larger counterparts on this benchmark, suggesting the tasks reward focused analysis over verbose reasoning.

Project Structure

invoice_guard/
|---- openenv.yaml              # OpenEnv manifest
|---- pyproject.toml             # Dependencies (managed by uv)
|---- uv.lock                   # Locked dependencies
|---- Dockerfile                # Container image definition
|---- models.py                 # All data models (Action, Observation, State, entities)
|---- client.py                 # InvoiceGuardEnv client (EnvClient subclass)
|---- inference.py              # Baseline LLM agent script
|---- .env.example              # Environment variable template
|---- tasks/
|   |---- __init__.py
|   |---- definitions.py        # Synthetic case templates and ground truth
|---- graders/
|   |---- __init__.py
|   |---- scoring.py            # Deterministic multi-criteria grader
|---- server/
    |---- __init__.py
    |---- app.py                 # FastAPI application (HTTP + WebSocket)
    |---- invoice_guard_environment.py  # Core Environment implementation

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

piyush-mk
/

invoiceguard-code

InvoiceGuard -- Three-Way Invoice Matching Environment

Motivation

Tasks

Action Space

Investigation Actions (provide `action_type` only)

Proposal Action

Terminal Action

Action JSON Format

Observation Space

Reward Design

Grading

Decisions

Setup & Usage

Prerequisites

Local Development

Running the Baseline Agent

Docker

Deploy to Hugging Face Spaces

Baseline Scores (12 tasks)

Project Structure

Space using piyush-mk/invoiceguard-code 1

InvoiceGuard -- Three-Way Invoice Matching Environment

Motivation

Tasks

Action Space

Investigation Actions (provide action_type only)

Proposal Action

Terminal Action

Action JSON Format

Observation Space

Reward Design

Grading

Decisions

Setup & Usage

Prerequisites

Local Development

Running the Baseline Agent

Docker

Deploy to Hugging Face Spaces

Baseline Scores (12 tasks)

Project Structure

Space using piyush-mk/invoiceguard-code 1

Investigation Actions (provide `action_type` only)