BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs Paper • 2510.04721 • Published Oct 6
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 34
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts Paper • 2504.18428 • Published Apr 25
MathConstruct: Challenging LLM Reasoning with Constructive Proofs Paper • 2502.10197 • Published Feb 14
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts Paper • 2407.03203 • Published Jul 3, 2024 • 12
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning Paper • 2505.05758 • Published May 9
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark Paper • 2405.12209 • Published May 20, 2024
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning Paper • 2505.23754 • Published May 29 • 15
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data Paper • 2405.14333 • Published May 23, 2024 • 43
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models Paper • 2511.11134 • Published 23 days ago • 31