HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Paper • 2602.13964 • Published Feb 15 • 11
SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration Paper • 2603.03823 • Published Mar 4 • 7
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning Paper • 2602.16742 • Published Feb 18 • 12
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem Paper • 2512.24873 • Published Dec 31, 2025 • 108