REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites Paper • 2504.11543 • Published Apr 15, 2025 • 2
h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning Paper • 2510.07312 • Published Oct 8, 2025
HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification Paper • 2603.15617 • Published Mar 16 • 6
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning Paper • 2604.14140 • Published 8 days ago • 1