New Benchmark Dataset

#2
by burtenshaw - opened

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

image

⭐️ comment and link to you dataset repo and sources using the benchmark.

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

OpenEvals org

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

OpenEvals org

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

OpenEvals org

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

📋 New Benchmark: FINAL Bench — Functional Metacognitive Reasoning

Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models
(Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang — currently under review)

Blog: https://huggingface.co/blog/FINAL-Bench/metacognitive

Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

What it measures

FINAL Bench is the first benchmark for evaluating functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Unlike MMLU/GPQA that measure final-answer accuracy, FINAL Bench asks: "What did you do when you got it wrong?"

Key specs

  • 100 tasks | 15 domains | 8 TICOS metacognitive types | 3 difficulty grades
  • 5-axis rubric: MA (Metacognitive Accuracy), ER (Error Recovery), FA (Factual Accuracy), CO (Coherence), SP (Specificity)
  • Hidden cognitive traps (confirmation bias, anchoring, base-rate neglect) embedded in every task
  • 9 SOTA models evaluated: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, etc.
  • DOI: 10.57967/hf/7873

eval.yaml

eval.yaml has been added to the dataset repo.

We would love to be included in the benchmark shortlist! 🚀

OpenEvals org

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

Hi @SaylorTwift ,

Thank you for the kind message. I’d be very happy to have a quick chat.

Just to clarify our setup: FINAL Bench is our standalone benchmark for functional metacognitive reasoning, while ALL Bench is our unified leaderboard that brings FINAL Bench together with other major benchmarks in one comparable view.

I’m glad to hear that ALL Bench resonates with your vision for leaderboards on the Hub. I’d love to discuss how it could fit with the OpenEvals / community evals direction, and also whether FINAL Bench itself might eventually be considered for the shortlist as the ecosystem expands.

Happy to coordinate here or by email, whichever is easier for you.

OpenEvals org

Email is best! What email can i reach you at ?

Email is best! What email can i reach you at ?

kimminsik1116@gmail.com

Sign up or log in to comment