Spaces:

OpenEvals
/

README

Running

New Benchmark Dataset

pinned

by burtenshaw - opened Jan 29

Jan 29

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

⭐️ comment and link to you dataset repo and sources using the benchmark.

adorkin

Feb 6

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

yimingliang

Feb 9

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

SaylorTwift

OpenEvals org Feb 10

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

SaylorTwift

OpenEvals org Feb 10

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion Feb 10

adorkin

Feb 10

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

SaylorTwift

OpenEvals org Feb 10

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

adorkin

Feb 10

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

SeaWolf-AI

Feb 22

📋 New Benchmark: FINAL Bench — Functional Metacognitive Reasoning

Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models
(Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang — currently under review)

Blog: https://huggingface.co/blog/FINAL-Bench/metacognitive

Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

What it measures

FINAL Bench is the first benchmark for evaluating functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Unlike MMLU/GPQA that measure final-answer accuracy, FINAL Bench asks: "What did you do when you got it wrong?"

Key specs

100 tasks | 15 domains | 8 TICOS metacognitive types | 3 difficulty grades
5-axis rubric: MA (Metacognitive Accuracy), ER (Error Recovery), FA (Factual Accuracy), CO (Coherence), SP (Specificity)
Hidden cognitive traps (confirmation bias, anchoring, base-rate neglect) embedded in every task
9 SOTA models evaluated: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, etc.
DOI: 10.57967/hf/7873

eval.yaml

eval.yaml has been added to the dataset repo.

We would love to be included in the benchmark shortlist! 🚀

SaylorTwift

OpenEvals org Mar 9

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

SeaWolf-AI

Mar 9

Hey @SeaWolf-AI ! Sorry, I missed your message. We are limiting the number of benchmarks on the hub for now so that we can grow the ones we already have before adding more. However, we just noticed your "All bench leaderboard," and it's great! This is exactly what we have in mind when pushing leaderboards on the hub. Would you be up for a quick chat ?

Hi @SaylorTwift ,

Thank you for the kind message. I’d be very happy to have a quick chat.

Just to clarify our setup: FINAL Bench is our standalone benchmark for functional metacognitive reasoning, while ALL Bench is our unified leaderboard that brings FINAL Bench together with other major benchmarks in one comparable view.

I’m glad to hear that ALL Bench resonates with your vision for leaderboards on the Hub. I’d love to discuss how it could fit with the OpenEvals / community evals direction, and also whether FINAL Bench itself might eventually be considered for the shortlist as the ecosystem expands.

Happy to coordinate here or by email, whichever is easier for you.

SaylorTwift

OpenEvals org Mar 9

Email is best! What email can i reach you at ?

SeaWolf-AI

Mar 9

Email is best! What email can i reach you at ?

kimminsik1116@gmail.com

piushorn

Mar 27

Hi @SaylorTwift and the OpenEvals team,
I'd like to request that pdf-parse-bench be added to the official benchmark allowlist.

What it benchmarks: PDF parsing quality for mathematical formula and table extraction, evaluated via LLM-as-a-Judge on synthetically generated PDFs with automatic ground truth from LaTeX source.

Dataset: https://huggingface.co/datasets/piushorn/pdf-parse-bench
GitHub: https://github.com/phorn1/pdf-parse-bench

Why LLM-as-a-Judge? Rule-based metrics correlate poorly with human judgment. We validated this in two dedicated human annotation studies:

Formula extraction (750 ratings): best rule-based metric r = 0.31, LLM judge r = 0.77 (https://arxiv.org/abs/2512.09874)
Table extraction (1,500+ ratings): rule-based TEDS/GriTS top at r = 0.70, LLM judge r = 0.93 (https://arxiv.org/abs/2603.18652)

Current state:

22 models benchmarked on OCR parsing
eval.yaml already present in the dataset repo
pip-installable evaluation package: pip install pdf-parse-bench

I'm happy to submit a PR to huggingface.js to register pdf-parse-bench as a framework identifier. Please let me know if there's anything else needed.

Thanks!

jflynt

Apr 12

•

edited Apr 14

Hi, we'd like to register OrgForge EpistemicBench as an official benchmark. Dataset: aeriesec/orgforge & Leaderboard. The eval.yaml is in the repo root. We also request orgforge-epistemicbench be added to the evaluation_framework enumerable in eval.ts.

What it measures

EpistemicBench evaluates agentic reasoning over a causally grounded enterprise corpus, not what a model knows, but how it reasons under constrained information access. Three tracks:

PERSPECTIVE — Can a model stay within an actor's visibility cone and knowledge horizon while answering correctly? Out-of-cone tool calls are penalized even when they produce the right answer.
COUNTERFACTUAL — Can a model identify the correct causal mechanism and traverse a cause-effect chain in the correct order?
SILENCE — Can a model prove something didn't happen by searching the right artifact space before concluding absence? A correct "no" without evidence of search scores zero on trajectory regardless of answer correctness.

Why the scoring design is intentionally different

The primary metric is violation_adjusted_combined_score = combined_score × (1 - violation_rate)². Trajectory quality is weighted at 60–70% on two of the three tracks. A model cannot overcome epistemic gate violations through high answer accuracy alone. Answers receive no partial credit, and zero-shot answers are verified via NLI to confirm corpus grounding rather than fluent hallucination. This was a deliberate design decision: outcome-only scoring is a structural weakness of existing benchmarks and we didn't want to replicate it.

Three evaluation conditions are defined in eval.yaml: gated (primary), ungated (establishes the Epistemic Tax ceiling), and zero-shot (establishes the hallucination floor). The delta between ungated and gated combined_score is the Epistemic Tax, a derived metric with no equivalent in current benchmarks.

Results across six models (n=78)

Model	Zero-Shot	Ungated	Gated	Compliance
Claude Sonnet 4.6	0.500	0.548	0.551	compliant
Kimi 2.5	0.395	0.507	0.540	compliant
DeepSeek v3.2	0.500	0.444	0.538	compliant
Mistral Large 3	0.428	0.452	0.472	compliant
Qwen3 235B	0.339	0.400	0.389	compliant

The results demonstrate exactly the kind of differentiation the benchmark was designed to surface. Zero-shot scores cluster tightly across five of six models while gated scores diverge substantially, parametric knowledge explains little of the variance in agentic performance. Gated outperforms ungated for every model except Qwen, showing that well-designed permission constraints complete models rather than restrict them. The SILENCE track produces the sharpest differentiation: zero-shot search space coverage is zero across all models, gated coverage ranges from 82.7% to 94.4%.

lianghsun

18 days ago

Hi team — I'd like to register a benchmark dataset for inclusion in the Community Evals shortlist.

Dataset: https://huggingface.co/datasets/lianghsun/tw-legal-benchmark-v1
Domain / Language: Taiwan law, Traditional Chinese (zh-Hant)
Format: 209 multiple-choice questions
Eval framework: inspect-ai (eval.yaml is already in the repo root, following the docs)

The shortlist currently covers strong general-knowledge and reasoning benchmarks (MMLU-Pro, GPQA, HLE, AIME, MuSR), but doesn't yet include a Traditional Chinese benchmark or a domain-specific legal benchmark. Given Hugging Face's stated goal of expanding to "the most relevant benchmarks" and "new tasks and domains that challenge SOTA models," I think this is a useful gap-filler:

Most existing Chinese benchmarks (C-Eval, CMMLU) are Simplified Chinese and PRC-centric. Taiwan legal QA exercises a distinct legal system, terminology, and writing convention.
Legal reasoning is a domain where current frontier models still show meaningful variance — useful signal beyond saturated general benchmarks.

Happy to:

Adjust the eval.yaml / dataset card to whatever spec you'd like.
Provide reference results across a few open models (e.g., Qwen, Llama, our Formosa-1) so you can sanity-check the benchmark behaves as expected.
Open a PR to add zh-Hant or a legal tag to the framework enum if needed.

For reference, I also opened https://github.com/huggingface/hub-docs/issues/2311 a few weeks ago before realising this discussion is the right venue — happy to consolidate everything here. Let me know what would be most helpful on your end. Thanks!

bepipeV

8 days ago

Vinayak Multistep Recursive Reasoning Benchmark (VMRRB)

Benchmark for evaluating advanced reasoning, recursive dependency resolution, and robustness capabilities of large language models in dynamic, noisy, and structurally challenging environments.

It uses dynamically created database for each run, so each run is unique. (No Static Database)

GitHub Repo:
https://github.com/vbepipe/vmrrb-benchmark

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment