YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ML Debug Env: Teaching AI Agents to Debug Like Engineers, Not Oracles

Built for the Meta Γ— PyTorch Γ— Scaler OpenEnv Hackathon β€” April 2026


πŸ“ This is the submission blog post for ML Debug Env β€” Meta Γ— PyTorch Γ— Scaler OpenEnv Hackathon 2026

The Problem

Every ML engineer has stared at a broken training script. Sometimes it crashes with an explicit error. Sometimes loss quietly explodes to NaN. Sometimes the model trains perfectly, reports 96% accuracy, and the evaluation is completely invalid.

Debugging is not about knowing the answer. It's about gathering evidence β€” running the code, reading tracebacks, checking gradient norms, forming a hypothesis, and fixing.

We built an environment that trains AI agents to do exactly this.


What We Built

ML Debug Env is a partially observable reinforcement learning environment where agents debug broken PyTorch training scripts. Built on OpenEnv.

The key design decision: the agent starts blind.

On reset(), the agent receives only a minimal alert β€” the kind of message an on-call engineer sees at 2am:

"Training job failed. Final loss: nan."

No buggy code. No traceback. No hints. Just a failure notice and a set of diagnostic tools.

The agent must then decide what to investigate using a 5-step budget:

Tool What it returns
run_code Runs the buggy script, returns stdout/stderr
get_traceback Returns full traceback if code crashed
inspect_gradients Injects gradient norm logging, runs one batch
print_shapes Injects shape hooks, returns tensor dims at each layer
view_source Reveals the full buggy script (costs 1 step)

After gathering evidence, the agent submits a fix β€” a complete corrected Python script. The grader actually executes the fixed code in a subprocess. No regex matching. No shortcuts. The code has to run.


The Tasks

Eight tasks of increasing difficulty, covering the most common classes of real PyTorch bugs:

Task Difficulty What's Broken
shape_mismatch Easy nn.Linear input dim wrong β†’ explicit crash
training_collapse Medium Bad LR β†’ NaN loss, or wrong loss fn β†’ plateau
wrong_device Medium Model on GPU, data on CPU β†’ explicit crash
gradient_not_zeroed Medium-Hard Missing zero_grad() β†’ loss explodes silently
data_leakage Hard Normalized before split β†’ metrics look great but invalid
missing_eval_mode Hard No model.eval() β†’ non-deterministic metrics
compound_shape_device Medium-Hard TWO bugs: shape mismatch + device mismatch
compound_leakage_eval Expert TWO bugs: data leakage + missing eval mode

The compound tasks are the hardest β€” the agent must find and fix two independent bugs simultaneously, both silent, neither causing a crash.


Scoring

Six-stage partial credit ladder:

0.01 β†’ Wrong bug type identified
0.20 β†’ Right type, fixed code crashes
0.40 β†’ Code runs, training doesn't complete
0.60 β†’ Training completes, root cause not fixed
0.80 β†’ Root cause fixed, success signal not confirmed
0.99 β†’ Perfect fix β€” code runs, training finishes, signal confirmed

Plus an efficiency multiplier: fix correctly in ≀2 steps β†’ score Γ—1.2. This rewards agents that learn to inspect efficiently rather than brute-forcing view_source on every task.

An LLM judge (Groq / llama-3.3-70b) additionally scores the agent's diagnosis on root cause correctness, mechanistic explanation, and specificity β€” adding up to 0.15 reasoning reward on top of execution reward.


Adaptive Curriculum

An AdversarialScheduler tracks per-task performance across episodes. Bug types where the agent consistently scores below 0.6 are marked "weak." Future reset() calls serve weak tasks 70% of the time with random seeds (novel code variants), and strong tasks only 30% of the time. The environment gets harder as the agent improves β€” exactly like the adversarial designer pattern from Kube SRE Gym.


Training with GRPO

We trained Qwen2.5-1.5B-Instruct using GRPO (Group Relative Policy Optimization) with LoRA (4-bit, rank 16) on the environment.

Baseline (untrained model):

  • Immediately calls view_source on almost every task β€” brute force pattern
  • Scores ~0.15 average across all tasks
  • Scores 0.0 on compound_leakage_eval β€” tries to inspect but can't complete the workflow

After GRPO training (venue compute β€” A100, 500 steps):

  • T4 baseline run (200 steps): Initial reward 0.024 β†’ Final reward 0.190 (+0.166 improvement) Reward Curve

At venue on H100 compute: 500 steps, full reward curve to be added here.

  • Agent learns to call run_code β†’ inspect_gradients before viewing source
  • Compound tasks show steeper improvement curve β€” harder tasks provide stronger gradient signal

The emergent inspection strategy shift β€” from brute-force view_source to evidence-based run_code + inspect_gradients β€” is the behavior the reward signal shaped. The agent learned it without being told.


Architecture

Agent
  β”‚
  β”‚ reset() β†’ alert only (no code)
  β–Ό
FastAPI Server (OpenEnv)
  β”‚
  β”œβ”€β”€ inspect action β†’ execute_tool() β†’ tool output
  β”‚       run_code, get_traceback, inspect_gradients,
  β”‚       print_shapes, view_source
  β”‚
  β”œβ”€β”€ fix action β†’ Grader.grade()
  β”‚       subprocess.run(fixed_code)
  β”‚       6-stage scoring + LLM judge
  β”‚       efficiency multiplier (1.0–1.2Γ—)
  β”‚
  └── AdversarialScheduler
          tracks weak tasks β†’ skews future resets
          random seeds for novel variants

Links


What's Next

The inspection tool system is the foundation for a much harder environment. Future work:

  • Multi-file debugging (bug spans across data pipeline + model definition)
  • Runtime tool calls mid-training (agent can inject print statements and rerun)
  • Compound bugs with 3+ independent failures
  • Agent-generated bug variants for truly infinite curriculum
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using rak2315/ml-debug-env-blog 1