--- title: DataCleanEnv emoji: 🧹 colorFrom: blue colorTo: green sdk: docker app_port: 8000 tags: - openenv - rl-environment - data-cleaning - evaluation - trl --- # DataCleanEnv β€” Data Quality Analysis & Cleaning Environment A real-world OpenEnv environment where AI agents learn to identify and fix data quality issues through iterative inspection, correction, and validation. ## Motivation Data cleaning is one of the most common and time-consuming tasks in data engineering. Analysts spend up to 80% of their time cleaning data before analysis. This environment trains and evaluates LLM agents on their ability to detect and fix real-world data quality problems β€” invalid formats, missing values, duplicates, outliers, referential integrity violations, and more. ## Action Space Agents interact via string commands: | Command | Description | |---------|-------------| | `inspect("column_name")` | View column statistics, sample values, and issue hints | | `fix(row, "column", "value")` | Correct a specific cell value | | `delete(row)` | Remove a duplicate or invalid row | | `submit()` | Finalize work and receive final score | ## Observation Space Each observation includes: | Field | Type | Description | |-------|------|-------------| | `task_id` | str | Active task identifier | | `task_description` | str | What the data represents and quality rules | | `difficulty` | str | "easy", "medium", or "hard" | | `data_preview` | str | Current dataset as formatted text table | | `column_info` | str | Column names, types, and descriptions | | `feedback` | str | Result of last action | | `actions_remaining` | int | Steps left before auto-submit | | `issues_fixed` | int | Count of resolved issues | | `total_issues` | int | Total known issues in dataset | | `current_score` | float | Running score (0.0–1.0) | | `action_history` | list | Last 10 commands executed | ## Tasks ### Easy: Customer Contacts - **15 rows, 5 columns** β€” name, email, phone, city, signup_date - **6 issues**: invalid emails, phone with letters, empty city, wrong date format, duplicate row - **15 max steps** ### Medium: Sales Records - **30 rows, 7 columns** β€” order_id, customer_name, product, quantity, unit_price, order_date, region - **12 issues**: mixed date formats, negative quantities/prices, price outliers, inconsistent region names, duplicates, missing IDs, excess whitespace - **25 max steps** ### Hard: Employee Records - **40 rows, 9 columns** β€” emp_id, name, email, department, hire_date, termination_date, salary, manager_id, performance_score - **18 issues**: referential integrity violations (manager_id), temporal inconsistencies (termination before hire), salary outliers, invalid performance scores, department name inconsistencies, semantic duplicates, invalid dates, excess whitespace - **35 max steps** ## Reward Design - Each correctly fixed issue: **+1/total_issues** - Damaging good data (fixing a cell that had no issue): **-0.05** - Deleting a non-duplicate row: **-0.05** - Inspect actions: no reward change (information gathering) - Final score clamped to **[0.0, 1.0]** Grading is **validation-based** (not exact match): - Emails validated by regex pattern - Dates checked for YYYY-MM-DD format and validity - Numbers checked against allowed ranges - Canonical values checked against defined sets - Referential integrity checked against existing IDs ## Setup ### Local Development ```bash # Install dependencies pip install openenv-core fastapi uvicorn requests openai # Run the server uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload ``` ### Docker ```bash docker build -t data-clean-env:latest . docker run -d -p 8000:8000 data-clean-env:latest ``` ### Running Inference ```bash export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="your-model-name" export HF_TOKEN="your-token" export ENV_URL="http://localhost:8000" python inference.py ``` ## API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/health` | GET | Health check | | `/reset` | POST | Start new episode: `{"task_id": "customer_contacts"}` | | `/step` | POST | Execute action: `{"action": {"command": "inspect(\"email\")"}}` | | `/state` | GET | Get current environment state | | `/docs` | GET | OpenAPI documentation | | `/web/` | GET | Interactive Gradio web UI | | `/ws` | WS | WebSocket for stateful agent sessions | | `/mcp` | POST/WS | MCP tool support for compatible agents | ## Benchmark Results Tested with plan-then-execute inference strategy across 4 models: | Model | Easy | Medium | Hard | Expert | Average | |-------|------|--------|------|--------|---------| | Llama-3.3-70B-Instruct | **1.00** | **1.00** | **0.73** | **0.75** | **0.87** | | Qwen2.5-72B-Instruct | 0.78 | 1.00 | 0.52 | 0.82 | 0.78 | | DeepSeek-V3 | 1.00 | 0.87 | 0.33 | 0.00 | 0.55 | | Llama-3.1-8B-Instruct | 0.73 | 0.00 | 0.00 | 0.00 | 0.18 | Key findings: - 70B+ models achieve near-perfect scores on easy/medium tasks - Hard/expert tasks require strong multi-column reasoning - Plan-then-execute strategy scales well with model capability ## Seed-Based Data Variation Each task supports reproducible randomized episodes via the `seed` parameter: ```bash # Deterministic (original data): POST /reset {"task_id": "customer_contacts"} # Randomized variant (same issue types, different corrupted rows): POST /reset {"task_id": "customer_contacts", "seed": 42} ``` This enables RL training with diverse episodes β€” the agent must learn data cleaning *skills*, not memorize fixed answers. ## Training with TRL (GRPO) The environment integrates with TRL's `GRPOTrainer` via the `DataCleanToolEnv` class in `train.py`: ```bash # Start the server uvicorn server.app:app --host 0.0.0.0 --port 8000 # Run training python train.py --model "Qwen/Qwen3-0.6B" ``` The tool environment exposes `inspect()`, `fix()`, `delete()`, `submit()` as individual methods with docstrings that TRL auto-discovers for function calling. ## Benchmarking Evaluate any model across all tasks: ```bash # Single evaluation python eval.py --model "meta-llama/Llama-3.1-8B-Instruct" # Multi-seed evaluation (measures variance) python eval.py --seeds 5 --json # Specific tasks only python eval.py --tasks customer_contacts sales_records ``` ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DataCleanEnv β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ /reset β”‚ /step β”‚ /ws β”‚ /web/ β”‚ β”‚ /state β”‚ /health β”‚ /mcp β”‚ /docs β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ server/environment.py β€” State Machine β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ tasks.py β”‚ β”‚graders.pyβ”‚ β”‚action_parseβ”‚ β”‚ β”‚ β”‚ 4 tasks β”‚ β”‚12 validatorsβ”‚ β”‚robust parseβ”‚ β”‚ β”‚ β”‚ + seeds β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ inference.py β€” Plan-Then-Execute Agent β”‚ β”‚ train.py β€” TRL GRPO Training Pipeline β”‚ β”‚ eval.py β€” Model Benchmarking β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Technical Details - **Framework**: OpenEnv (openenv-core 0.2.3) - **Server**: FastAPI + Uvicorn - **Data storage**: In-memory Python dicts (no database required) - **Runtime**: < 20 min inference on 2 vCPU / 8GB RAM - **Python**: 3.10+