Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs Paper • 2601.08763 • Published 4 days ago • 116
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning Paper • 2601.09667 • Published 3 days ago • 67
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles Paper • 2510.06475 • Published Oct 7, 2025 • 1 • 2
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7, 2025 • 2 • 2
KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains Paper • 2311.09797 • Published Nov 16, 2023 • 1
DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data Paper • 2311.09805 • Published Nov 16, 2023 • 3
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles Paper • 2510.06475 • Published Oct 7, 2025 • 1
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering Paper • 2510.06426 • Published Oct 7, 2025 • 2
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1, 2025 • 46
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84