Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 32
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 8 days ago • 46