Spaces:

tns
/

openenv-data-clean

Sleeping

App Files Files Community

openenv-data-clean / README.md

Tarkeshwar

Add real benchmark results from model comparison experiment

fb86de2 about 2 months ago

preview code

raw

history blame contribute delete

8.22 kB

metadata

title: DataCleanEnv
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
  - openenv
  - rl-environment
  - data-cleaning
  - evaluation
  - trl

DataCleanEnv — Data Quality Analysis & Cleaning Environment

A real-world OpenEnv environment where AI agents learn to identify and fix data quality issues through iterative inspection, correction, and validation.

Motivation

Data cleaning is one of the most common and time-consuming tasks in data engineering. Analysts spend up to 80% of their time cleaning data before analysis. This environment trains and evaluates LLM agents on their ability to detect and fix real-world data quality problems — invalid formats, missing values, duplicates, outliers, referential integrity violations, and more.

Action Space

Agents interact via string commands:

Command	Description
`inspect("column_name")`	View column statistics, sample values, and issue hints
`fix(row, "column", "value")`	Correct a specific cell value
`delete(row)`	Remove a duplicate or invalid row
`submit()`	Finalize work and receive final score

Observation Space

Each observation includes:

Field	Type	Description
`task_id`	str	Active task identifier
`task_description`	str	What the data represents and quality rules
`difficulty`	str	"easy", "medium", or "hard"
`data_preview`	str	Current dataset as formatted text table
`column_info`	str	Column names, types, and descriptions
`feedback`	str	Result of last action
`actions_remaining`	int	Steps left before auto-submit
`issues_fixed`	int	Count of resolved issues
`total_issues`	int	Total known issues in dataset
`current_score`	float	Running score (0.0–1.0)
`action_history`	list	Last 10 commands executed

Tasks

Easy: Customer Contacts

15 rows, 5 columns — name, email, phone, city, signup_date
6 issues: invalid emails, phone with letters, empty city, wrong date format, duplicate row
15 max steps

Medium: Sales Records

30 rows, 7 columns — order_id, customer_name, product, quantity, unit_price, order_date, region
12 issues: mixed date formats, negative quantities/prices, price outliers, inconsistent region names, duplicates, missing IDs, excess whitespace
25 max steps

Hard: Employee Records

40 rows, 9 columns — emp_id, name, email, department, hire_date, termination_date, salary, manager_id, performance_score
18 issues: referential integrity violations (manager_id), temporal inconsistencies (termination before hire), salary outliers, invalid performance scores, department name inconsistencies, semantic duplicates, invalid dates, excess whitespace
35 max steps

Reward Design

Each correctly fixed issue: +1/total_issues
Damaging good data (fixing a cell that had no issue): -0.05
Deleting a non-duplicate row: -0.05
Inspect actions: no reward change (information gathering)
Final score clamped to [0.0, 1.0]

Grading is validation-based (not exact match):

Emails validated by regex pattern
Dates checked for YYYY-MM-DD format and validity
Numbers checked against allowed ranges
Canonical values checked against defined sets
Referential integrity checked against existing IDs

Setup

Local Development

# Install dependencies
pip install openenv-core fastapi uvicorn requests openai

# Run the server
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Docker

docker build -t data-clean-env:latest .
docker run -d -p 8000:8000 data-clean-env:latest

Running Inference

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-token"
export ENV_URL="http://localhost:8000"

python inference.py

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/reset`	POST	Start new episode: `{"task_id": "customer_contacts"}`
`/step`	POST	Execute action: `{"action": {"command": "inspect(\"email\")"}}`
`/state`	GET	Get current environment state
`/docs`	GET	OpenAPI documentation
`/web/`	GET	Interactive Gradio web UI
`/ws`	WS	WebSocket for stateful agent sessions
`/mcp`	POST/WS	MCP tool support for compatible agents

Benchmark Results

Tested with plan-then-execute inference strategy across 4 models:

Model	Easy	Medium	Hard	Expert	Average
Llama-3.3-70B-Instruct	1.00	1.00	0.73	0.75	0.87
Qwen2.5-72B-Instruct	0.78	1.00	0.52	0.82	0.78
DeepSeek-V3	1.00	0.87	0.33	0.00	0.55
Llama-3.1-8B-Instruct	0.73	0.00	0.00	0.00	0.18

Key findings:

70B+ models achieve near-perfect scores on easy/medium tasks
Hard/expert tasks require strong multi-column reasoning
Plan-then-execute strategy scales well with model capability

Seed-Based Data Variation

Each task supports reproducible randomized episodes via the seed parameter:

# Deterministic (original data):
POST /reset {"task_id": "customer_contacts"}

# Randomized variant (same issue types, different corrupted rows):
POST /reset {"task_id": "customer_contacts", "seed": 42}

This enables RL training with diverse episodes — the agent must learn data cleaning skills, not memorize fixed answers.

Training with TRL (GRPO)

The environment integrates with TRL's GRPOTrainer via the DataCleanToolEnv class in train.py:

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Run training
python train.py --model "Qwen/Qwen3-0.6B"

The tool environment exposes inspect(), fix(), delete(), submit() as individual methods with docstrings that TRL auto-discovers for function calling.

Benchmarking

Evaluate any model across all tasks:

# Single evaluation
python eval.py --model "meta-llama/Llama-3.1-8B-Instruct"

# Multi-seed evaluation (measures variance)
python eval.py --seeds 5 --json

# Specific tasks only
python eval.py --tasks customer_contacts sales_records

Architecture

┌─────────────────────────────────────────────────┐
│                  DataCleanEnv                    │
├──────────┬──────────┬───────────┬───────────────┤
│ /reset   │ /step    │ /ws       │ /web/         │
│ /state   │ /health  │ /mcp      │ /docs         │
├──────────┴──────────┴───────────┴───────────────┤
│  server/environment.py — State Machine          │
│  ┌──────────┐  ┌──────────┐  ┌────────────┐    │
│  │ tasks.py │  │graders.py│  │action_parse│    │
│  │ 4 tasks  │  │12 validators│ │robust parse│    │
│  │ + seeds  │  │          │  │            │    │
│  └──────────┘  └──────────┘  └────────────┘    │
├─────────────────────────────────────────────────┤
│  inference.py — Plan-Then-Execute Agent         │
│  train.py     — TRL GRPO Training Pipeline      │
│  eval.py      — Model Benchmarking              │
└─────────────────────────────────────────────────┘

Technical Details

Framework: OpenEnv (openenv-core 0.2.3)
Server: FastAPI + Uvicorn
Data storage: In-memory Python dicts (no database required)
Runtime: < 20 min inference on 2 vCPU / 8GB RAM
Python: 3.10+