TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
Abstract
TermiGen introduces a pipeline for generating verifiable terminal environments and resilient trajectories to improve open-weight LLMs' ability to execute complex tasks and recover from runtime errors.
Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.
Community
This paper introduces TermiGen, an end-to-end pipeline designed to enhance the performance of open-weight Large Language Models (LLMs) in executing complex terminal tasks. To address the scarcity of high-fidelity training data and the distributional mismatch where models struggle to recover from their own mistakes, the framework employs two key phases:
Environment Synthesis: It uses a multi-agent refinement loop to generate diverse, functionally valid tasks and verifiable Docker containers.
Trajectory Collection: It utilizes a Generator-Critic protocol that actively injects errors into trajectories to teach models how to diagnose and recover from runtime failures.
The resulting model, TermiGen-Qwen2.5-Coder-32B, achieves a state-of-the-art 31.3% pass rate on TerminalBench, outperforming existing open-source baselines and even surpassing proprietary models like GPT-4o-mini.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Endless Terminals: Scaling RL Environments for Terminal Agents (2026)
- Training Versatile Coding Agents in Synthetic Environments (2025)
- EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience (2026)
- Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing (2025)
- ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training (2026)
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents (2026)
- Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper