openenv-data-clean / README.md
Tarkeshwar
Add real benchmark results from model comparison experiment
fb86de2
metadata
title: DataCleanEnv
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
  - openenv
  - rl-environment
  - data-cleaning
  - evaluation
  - trl

DataCleanEnv β€” Data Quality Analysis & Cleaning Environment

A real-world OpenEnv environment where AI agents learn to identify and fix data quality issues through iterative inspection, correction, and validation.

Motivation

Data cleaning is one of the most common and time-consuming tasks in data engineering. Analysts spend up to 80% of their time cleaning data before analysis. This environment trains and evaluates LLM agents on their ability to detect and fix real-world data quality problems β€” invalid formats, missing values, duplicates, outliers, referential integrity violations, and more.

Action Space

Agents interact via string commands:

Command Description
inspect("column_name") View column statistics, sample values, and issue hints
fix(row, "column", "value") Correct a specific cell value
delete(row) Remove a duplicate or invalid row
submit() Finalize work and receive final score

Observation Space

Each observation includes:

Field Type Description
task_id str Active task identifier
task_description str What the data represents and quality rules
difficulty str "easy", "medium", or "hard"
data_preview str Current dataset as formatted text table
column_info str Column names, types, and descriptions
feedback str Result of last action
actions_remaining int Steps left before auto-submit
issues_fixed int Count of resolved issues
total_issues int Total known issues in dataset
current_score float Running score (0.0–1.0)
action_history list Last 10 commands executed

Tasks

Easy: Customer Contacts

  • 15 rows, 5 columns β€” name, email, phone, city, signup_date
  • 6 issues: invalid emails, phone with letters, empty city, wrong date format, duplicate row
  • 15 max steps

Medium: Sales Records

  • 30 rows, 7 columns β€” order_id, customer_name, product, quantity, unit_price, order_date, region
  • 12 issues: mixed date formats, negative quantities/prices, price outliers, inconsistent region names, duplicates, missing IDs, excess whitespace
  • 25 max steps

Hard: Employee Records

  • 40 rows, 9 columns β€” emp_id, name, email, department, hire_date, termination_date, salary, manager_id, performance_score
  • 18 issues: referential integrity violations (manager_id), temporal inconsistencies (termination before hire), salary outliers, invalid performance scores, department name inconsistencies, semantic duplicates, invalid dates, excess whitespace
  • 35 max steps

Reward Design

  • Each correctly fixed issue: +1/total_issues
  • Damaging good data (fixing a cell that had no issue): -0.05
  • Deleting a non-duplicate row: -0.05
  • Inspect actions: no reward change (information gathering)
  • Final score clamped to [0.0, 1.0]

Grading is validation-based (not exact match):

  • Emails validated by regex pattern
  • Dates checked for YYYY-MM-DD format and validity
  • Numbers checked against allowed ranges
  • Canonical values checked against defined sets
  • Referential integrity checked against existing IDs

Setup

Local Development

# Install dependencies
pip install openenv-core fastapi uvicorn requests openai

# Run the server
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Docker

docker build -t data-clean-env:latest .
docker run -d -p 8000:8000 data-clean-env:latest

Running Inference

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-token"
export ENV_URL="http://localhost:8000"

python inference.py

API Endpoints

Endpoint Method Description
/health GET Health check
/reset POST Start new episode: {"task_id": "customer_contacts"}
/step POST Execute action: {"action": {"command": "inspect(\"email\")"}}
/state GET Get current environment state
/docs GET OpenAPI documentation
/web/ GET Interactive Gradio web UI
/ws WS WebSocket for stateful agent sessions
/mcp POST/WS MCP tool support for compatible agents

Benchmark Results

Tested with plan-then-execute inference strategy across 4 models:

Model Easy Medium Hard Expert Average
Llama-3.3-70B-Instruct 1.00 1.00 0.73 0.75 0.87
Qwen2.5-72B-Instruct 0.78 1.00 0.52 0.82 0.78
DeepSeek-V3 1.00 0.87 0.33 0.00 0.55
Llama-3.1-8B-Instruct 0.73 0.00 0.00 0.00 0.18

Key findings:

  • 70B+ models achieve near-perfect scores on easy/medium tasks
  • Hard/expert tasks require strong multi-column reasoning
  • Plan-then-execute strategy scales well with model capability

Seed-Based Data Variation

Each task supports reproducible randomized episodes via the seed parameter:

# Deterministic (original data):
POST /reset {"task_id": "customer_contacts"}

# Randomized variant (same issue types, different corrupted rows):
POST /reset {"task_id": "customer_contacts", "seed": 42}

This enables RL training with diverse episodes β€” the agent must learn data cleaning skills, not memorize fixed answers.

Training with TRL (GRPO)

The environment integrates with TRL's GRPOTrainer via the DataCleanToolEnv class in train.py:

# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Run training
python train.py --model "Qwen/Qwen3-0.6B"

The tool environment exposes inspect(), fix(), delete(), submit() as individual methods with docstrings that TRL auto-discovers for function calling.

Benchmarking

Evaluate any model across all tasks:

# Single evaluation
python eval.py --model "meta-llama/Llama-3.1-8B-Instruct"

# Multi-seed evaluation (measures variance)
python eval.py --seeds 5 --json

# Specific tasks only
python eval.py --tasks customer_contacts sales_records

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  DataCleanEnv                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ /reset   β”‚ /step    β”‚ /ws       β”‚ /web/         β”‚
β”‚ /state   β”‚ /health  β”‚ /mcp      β”‚ /docs         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  server/environment.py β€” State Machine          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ tasks.py β”‚  β”‚graders.pyβ”‚  β”‚action_parseβ”‚    β”‚
β”‚  β”‚ 4 tasks  β”‚  β”‚12 validatorsβ”‚ β”‚robust parseβ”‚    β”‚
β”‚  β”‚ + seeds  β”‚  β”‚          β”‚  β”‚            β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  inference.py β€” Plan-Then-Execute Agent         β”‚
β”‚  train.py     β€” TRL GRPO Training Pipeline      β”‚
β”‚  eval.py      β€” Model Benchmarking              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technical Details

  • Framework: OpenEnv (openenv-core 0.2.3)
  • Server: FastAPI + Uvicorn
  • Data storage: In-memory Python dicts (no database required)
  • Runtime: < 20 min inference on 2 vCPU / 8GB RAM
  • Python: 3.10+