SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking Paper • 2605.25160 • Published 6 days ago • 3
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality? Paper • 2605.22109 • Published 9 days ago • 169
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence Paper • 2605.12882 • Published 17 days ago • 269
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise Paper • 2602.12783 • Published Feb 13 • 246
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers Paper • 2604.02648 • Published Apr 3 • 47
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers Paper • 2603.24414 • Published Mar 25 • 183
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling Paper • 2603.25746 • Published Mar 26 • 155
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published Mar 17 • 248