Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory Paper • 2505.15055 • Published May 21 • 1
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers Paper • 2403.02839 • Published Mar 5, 2024 • 2
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs Paper • 2406.10216 • Published Jun 14, 2024 • 2