Spaces:
Sleeping
title: DataCleanEnv
emoji: π§Ή
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
- openenv
- rl-environment
- data-cleaning
- evaluation
- trl
DataCleanEnv β Data Quality Analysis & Cleaning Environment
A real-world OpenEnv environment where AI agents learn to identify and fix data quality issues through iterative inspection, correction, and validation.
Motivation
Data cleaning is one of the most common and time-consuming tasks in data engineering. Analysts spend up to 80% of their time cleaning data before analysis. This environment trains and evaluates LLM agents on their ability to detect and fix real-world data quality problems β invalid formats, missing values, duplicates, outliers, referential integrity violations, and more.
Action Space
Agents interact via string commands:
| Command | Description |
|---|---|
inspect("column_name") |
View column statistics, sample values, and issue hints |
fix(row, "column", "value") |
Correct a specific cell value |
delete(row) |
Remove a duplicate or invalid row |
submit() |
Finalize work and receive final score |
Observation Space
Each observation includes:
| Field | Type | Description |
|---|---|---|
task_id |
str | Active task identifier |
task_description |
str | What the data represents and quality rules |
difficulty |
str | "easy", "medium", or "hard" |
data_preview |
str | Current dataset as formatted text table |
column_info |
str | Column names, types, and descriptions |
feedback |
str | Result of last action |
actions_remaining |
int | Steps left before auto-submit |
issues_fixed |
int | Count of resolved issues |
total_issues |
int | Total known issues in dataset |
current_score |
float | Running score (0.0β1.0) |
action_history |
list | Last 10 commands executed |
Tasks
Easy: Customer Contacts
- 15 rows, 5 columns β name, email, phone, city, signup_date
- 6 issues: invalid emails, phone with letters, empty city, wrong date format, duplicate row
- 15 max steps
Medium: Sales Records
- 30 rows, 7 columns β order_id, customer_name, product, quantity, unit_price, order_date, region
- 12 issues: mixed date formats, negative quantities/prices, price outliers, inconsistent region names, duplicates, missing IDs, excess whitespace
- 25 max steps
Hard: Employee Records
- 40 rows, 9 columns β emp_id, name, email, department, hire_date, termination_date, salary, manager_id, performance_score
- 18 issues: referential integrity violations (manager_id), temporal inconsistencies (termination before hire), salary outliers, invalid performance scores, department name inconsistencies, semantic duplicates, invalid dates, excess whitespace
- 35 max steps
Reward Design
- Each correctly fixed issue: +1/total_issues
- Damaging good data (fixing a cell that had no issue): -0.05
- Deleting a non-duplicate row: -0.05
- Inspect actions: no reward change (information gathering)
- Final score clamped to [0.0, 1.0]
Grading is validation-based (not exact match):
- Emails validated by regex pattern
- Dates checked for YYYY-MM-DD format and validity
- Numbers checked against allowed ranges
- Canonical values checked against defined sets
- Referential integrity checked against existing IDs
Setup
Local Development
# Install dependencies
pip install openenv-core fastapi uvicorn requests openai
# Run the server
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
Docker
docker build -t data-clean-env:latest .
docker run -d -p 8000:8000 data-clean-env:latest
Running Inference
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-token"
export ENV_URL="http://localhost:8000"
python inference.py
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Start new episode: {"task_id": "customer_contacts"} |
/step |
POST | Execute action: {"action": {"command": "inspect(\"email\")"}} |
/state |
GET | Get current environment state |
/docs |
GET | OpenAPI documentation |
/web/ |
GET | Interactive Gradio web UI |
/ws |
WS | WebSocket for stateful agent sessions |
/mcp |
POST/WS | MCP tool support for compatible agents |
Benchmark Results
Tested with plan-then-execute inference strategy across 4 models:
| Model | Easy | Medium | Hard | Expert | Average |
|---|---|---|---|---|---|
| Llama-3.3-70B-Instruct | 1.00 | 1.00 | 0.73 | 0.75 | 0.87 |
| Qwen2.5-72B-Instruct | 0.78 | 1.00 | 0.52 | 0.82 | 0.78 |
| DeepSeek-V3 | 1.00 | 0.87 | 0.33 | 0.00 | 0.55 |
| Llama-3.1-8B-Instruct | 0.73 | 0.00 | 0.00 | 0.00 | 0.18 |
Key findings:
- 70B+ models achieve near-perfect scores on easy/medium tasks
- Hard/expert tasks require strong multi-column reasoning
- Plan-then-execute strategy scales well with model capability
Seed-Based Data Variation
Each task supports reproducible randomized episodes via the seed parameter:
# Deterministic (original data):
POST /reset {"task_id": "customer_contacts"}
# Randomized variant (same issue types, different corrupted rows):
POST /reset {"task_id": "customer_contacts", "seed": 42}
This enables RL training with diverse episodes β the agent must learn data cleaning skills, not memorize fixed answers.
Training with TRL (GRPO)
The environment integrates with TRL's GRPOTrainer via the DataCleanToolEnv class in train.py:
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Run training
python train.py --model "Qwen/Qwen3-0.6B"
The tool environment exposes inspect(), fix(), delete(), submit() as individual methods with docstrings that TRL auto-discovers for function calling.
Benchmarking
Evaluate any model across all tasks:
# Single evaluation
python eval.py --model "meta-llama/Llama-3.1-8B-Instruct"
# Multi-seed evaluation (measures variance)
python eval.py --seeds 5 --json
# Specific tasks only
python eval.py --tasks customer_contacts sales_records
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β DataCleanEnv β
ββββββββββββ¬βββββββββββ¬ββββββββββββ¬ββββββββββββββββ€
β /reset β /step β /ws β /web/ β
β /state β /health β /mcp β /docs β
ββββββββββββ΄βββββββββββ΄ββββββββββββ΄ββββββββββββββββ€
β server/environment.py β State Machine β
β ββββββββββββ ββββββββββββ ββββββββββββββ β
β β tasks.py β βgraders.pyβ βaction_parseβ β
β β 4 tasks β β12 validatorsβ βrobust parseβ β
β β + seeds β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β inference.py β Plan-Then-Execute Agent β
β train.py β TRL GRPO Training Pipeline β
β eval.py β Model Benchmarking β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Technical Details
- Framework: OpenEnv (openenv-core 0.2.3)
- Server: FastAPI + Uvicorn
- Data storage: In-memory Python dicts (no database required)
- Runtime: < 20 min inference on 2 vCPU / 8GB RAM
- Python: 3.10+