---
title: DataCleanEnv
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
  - openenv
  - rl-environment
  - data-cleaning
  - evaluation
  - trl
---

# DataCleanEnv — Data Quality Analysis & Cleaning Environment

A real-world OpenEnv environment where AI agents learn to identify and fix data quality issues through iterative inspection, correction, and validation.

## Motivation

Data cleaning is one of the most common and time-consuming tasks in data engineering. Analysts spend up to 80% of their time cleaning data before analysis. This environment trains and evaluates LLM agents on their ability to detect and fix real-world data quality problems — invalid formats, missing values, duplicates, outliers, referential integrity violations, and more.

## Action Space

Agents interact via string commands:

| Command | Description |
|---------|-------------|
| `inspect("column_name")` | View column statistics, sample values, and issue hints |
| `fix(row, "column", "value")` | Correct a specific cell value |
| `delete(row)` | Remove a duplicate or invalid row |
| `submit()` | Finalize work and receive final score |

## Observation Space

Each observation includes:

| Field | Type | Description |
|-------|------|-------------|
| `task_id` | str | Active task identifier |
| `task_description` | str | What the data represents and quality rules |
| `difficulty` | str | "easy", "medium", or "hard" |
| `data_preview` | str | Current dataset as formatted text table |
| `column_info` | str | Column names, types, and descriptions |
| `feedback` | str | Result of last action |
| `actions_remaining` | int | Steps left before auto-submit |
| `issues_fixed` | int | Count of resolved issues |
| `total_issues` | int | Total known issues in dataset |
| `current_score` | float | Running score (0.0–1.0) |
| `action_history` | list | Last 10 commands executed |

## Tasks

### Easy: Customer Contacts
- **15 rows, 5 columns** — name, email, phone, city, signup_date
- **6 issues**: invalid emails, phone with letters, empty city, wrong date format, duplicate row
- **15 max steps**

### Medium: Sales Records
- **30 rows, 7 columns** — order_id, customer_name, product, quantity, unit_price, order_date, region
- **12 issues**: mixed date formats, negative quantities/prices, price outliers, inconsistent region names, duplicates, missing IDs, excess whitespace
- **25 max steps**

### Hard: Employee Records
- **40 rows, 9 columns** — emp_id, name, email, department, hire_date, termination_date, salary, manager_id, performance_score
- **18 issues**: referential integrity violations (manager_id), temporal inconsistencies (termination before hire), salary outliers, invalid performance scores, department name inconsistencies, semantic duplicates, invalid dates, excess whitespace
- **35 max steps**

## Reward Design

- Each correctly fixed issue: **+1/total_issues**
- Damaging good data (fixing a cell that had no issue): **-0.05**
- Deleting a non-duplicate row: **-0.05**
- Inspect actions: no reward change (information gathering)
- Final score clamped to **[0.0, 1.0]**

Grading is **validation-based** (not exact match):
- Emails validated by regex pattern
- Dates checked for YYYY-MM-DD format and validity
- Numbers checked against allowed ranges
- Canonical values checked against defined sets
- Referential integrity checked against existing IDs

## Setup

### Local Development

```bash
# Install dependencies
pip install openenv-core fastapi uvicorn requests openai

# Run the server
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
```

### Docker

```bash
docker build -t data-clean-env:latest .
docker run -d -p 8000:8000 data-clean-env:latest
```

### Running Inference

```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="your-model-name"
export HF_TOKEN="your-token"
export ENV_URL="http://localhost:8000"

python inference.py
```

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Start new episode: `{"task_id": "customer_contacts"}` |
| `/step` | POST | Execute action: `{"action": {"command": "inspect(\"email\")"}}` |
| `/state` | GET | Get current environment state |
| `/docs` | GET | OpenAPI documentation |
| `/web/` | GET | Interactive Gradio web UI |
| `/ws` | WS | WebSocket for stateful agent sessions |
| `/mcp` | POST/WS | MCP tool support for compatible agents |

## Benchmark Results

Tested with plan-then-execute inference strategy across 4 models:

| Model | Easy | Medium | Hard | Expert | Average |
|-------|------|--------|------|--------|---------|
| Llama-3.3-70B-Instruct | **1.00** | **1.00** | **0.73** | **0.75** | **0.87** |
| Qwen2.5-72B-Instruct | 0.78 | 1.00 | 0.52 | 0.82 | 0.78 |
| DeepSeek-V3 | 1.00 | 0.87 | 0.33 | 0.00 | 0.55 |
| Llama-3.1-8B-Instruct | 0.73 | 0.00 | 0.00 | 0.00 | 0.18 |

Key findings:
- 70B+ models achieve near-perfect scores on easy/medium tasks
- Hard/expert tasks require strong multi-column reasoning
- Plan-then-execute strategy scales well with model capability

## Seed-Based Data Variation

Each task supports reproducible randomized episodes via the `seed` parameter:

```bash
# Deterministic (original data):
POST /reset {"task_id": "customer_contacts"}

# Randomized variant (same issue types, different corrupted rows):
POST /reset {"task_id": "customer_contacts", "seed": 42}
```

This enables RL training with diverse episodes — the agent must learn data cleaning *skills*, not memorize fixed answers.

## Training with TRL (GRPO)

The environment integrates with TRL's `GRPOTrainer` via the `DataCleanToolEnv` class in `train.py`:

```bash
# Start the server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Run training
python train.py --model "Qwen/Qwen3-0.6B"
```

The tool environment exposes `inspect()`, `fix()`, `delete()`, `submit()` as individual methods with docstrings that TRL auto-discovers for function calling.

## Benchmarking

Evaluate any model across all tasks:

```bash
# Single evaluation
python eval.py --model "meta-llama/Llama-3.1-8B-Instruct"

# Multi-seed evaluation (measures variance)
python eval.py --seeds 5 --json

# Specific tasks only
python eval.py --tasks customer_contacts sales_records
```

## Architecture

```
┌─────────────────────────────────────────────────┐
│                  DataCleanEnv                    │
├──────────┬──────────┬───────────┬───────────────┤
│ /reset   │ /step    │ /ws       │ /web/         │
│ /state   │ /health  │ /mcp      │ /docs         │
├──────────┴──────────┴───────────┴───────────────┤
│  server/environment.py — State Machine          │
│  ┌──────────┐  ┌──────────┐  ┌────────────┐    │
│  │ tasks.py │  │graders.py│  │action_parse│    │
│  │ 4 tasks  │  │12 validators│ │robust parse│    │
│  │ + seeds  │  │          │  │            │    │
│  └──────────┘  └──────────┘  └────────────┘    │
├─────────────────────────────────────────────────┤
│  inference.py — Plan-Then-Execute Agent         │
│  train.py     — TRL GRPO Training Pipeline      │
│  eval.py      — Model Benchmarking              │
└─────────────────────────────────────────────────┘
```

## Technical Details

- **Framework**: OpenEnv (openenv-core 0.2.3)
- **Server**: FastAPI + Uvicorn
- **Data storage**: In-memory Python dicts (no database required)
- **Runtime**: < 20 min inference on 2 vCPU / 8GB RAM
- **Python**: 3.10+