Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Paper • 2606.19704 • Published 4 days ago • 28
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds Paper • 2605.18827 • Published May 12 • 7
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments Paper • 2605.09131 • Published May 9 • 59