Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge Paper • 2510.18196 • Published Oct 21
Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans? Paper • 2503.17039 • Published Mar 21
JuStRank: Benchmarking LLM Judges for System Ranking Paper • 2412.09569 • Published Dec 12, 2024 • 20
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24 • 7