Running on CPU Upgrade 13.7k Open LLM Leaderboard ๐ 13.7k Track, rank and evaluate open LLMs and chatbots
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications Paper โข 2509.23879 โข Published Sep 28 โข 20
Aligning LLMs for Multilingual Consistency in Enterprise Applications Paper โข 2509.23659 โข Published Sep 28 โข 20
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Paper โข 2509.23673 โข Published Sep 28 โข 20
AccessEval: Benchmarking Disability Bias in Large Language Models Paper โข 2509.22703 โข Published Sep 22 โข 20
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding Paper โข 2505.17330 โข Published May 22 โข 22
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems Paper โข 2505.18366 โข Published May 23 โข 25
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper โข 2505.17332 โข Published May 22 โข 31