arxiv:2603.01562

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Published on Mar 2

· Submitted by

Qiyuan Zhang on Mar 3

#3 Paper of the day

Tencent Hunyuan

Upvote

Authors:

Abstract

RubricBench is introduced as a benchmark for evaluating rubric-guided reward models in large language model alignment, addressing the lack of discriminative complexity and ground-truth annotations in existing benchmarks.

AI-generated summary

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

DonJoey

Paper submitter about 2 hours ago

🚀 Are LLM Judges actually evaluating what matters, or just being distracted by surface-level polish?

As LLM alignment shifts from scalar Reward Models to Generative RMs (LLM-as-a-Judge), the community has increasingly relied on rubric-guided evaluation to prevent reward hacking. But this raises a critical, unanswered question: Can state-of-the-art models autonomously figure out the right rubric to evaluate in the first place? Our new paper introduces RubricBench, and the short answer is: No. They miss the point entirely.

🔥 What is RubricBench?

We built a curated, highly discriminative benchmark of 1,147 preference pairs across 5 domains (Chat, Code, STEM, etc.). We aggressively filtered out "easy" comparisons and targeted hard samples with input complexity, output surface bias (e.g., long but wrong answers), and process failures. Crucially, every single pair is augmented with expert-annotated, atomic, instruction-derived rubrics.

💡 Key Discoveries that Challenge the Status Quo:

📉 The 27% "Rubric Gap":

When evaluating responses, SOTA models (like DeepSeek-v3.2, GPT-4o-mini, and Gemini-3-Flash) jump by an average of ~27% in preference accuracy when you swap their self-generated rubrics for human gold-standard rubrics. The models have the reasoning power to judge correctly, but they fail to specify the right rules.

🧠 Cognitive Misalignment ("Attention Displacement"):

We found that models generate "checklist bloat." They obsess over surface-level traits (e.g., formatting, verbosity, using specific libraries) but completely fail to enforce core implicit constraints (like recognizing when a task is impossible or violating safety boundaries). This leads to Value Inversion—rewarding confident hallucinations over honest refusals.

🛑 Test-Time Compute Does NOT Fix It:

Simply generating more rubric items or using iterative self-refinement hits immediate diminishing returns. More compute just accumulates noise, proving that the bottleneck is value alignment, not generation capacity.

Why this matters:

Future research must move beyond merely scaling synthesis to address "rubric alignment"—developing methods that enable models to internalize human priority hierarchies. The ultimate goal is to transition models from simply expanding outputs to autonomously identifying the specific, high-value constraints that actually drive human judgments.