CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? Paper • 2605.16679 • Published 9 days ago • 50
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels Paper • 2510.06499 • Published Oct 7, 2025 • 33
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs Paper • 2411.13547 • Published Nov 20, 2024
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models Paper • 2503.22673 • Published Mar 28, 2025 • 12
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay Paper • 2504.03601 • Published Apr 4, 2025 • 18
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data Paper • 2502.20616 • Published Feb 28, 2025
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering Paper • 2509.09614 • Published Sep 11, 2025 • 7
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning Paper • 2509.19736 • Published Sep 24, 2025 • 12
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30, 2025 • 43
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding Paper • 2411.04282 • Published Nov 6, 2024 • 37