ArmBench-LLM 1.0: Benchmarking LLMs on Armenian Language Tasks

Community Article Published April 2, 2026

Following our recent release of ArmBench-TextEmbed (and the ATE-2 models), as well as our latest publication at the LoResLM 2026 EACL workshop, Metric AI Lab is thrilled to announce the next major leap in our non-commercial Armenian AI initiative: ArmBench-LLM 1.0.

Last year, we introduced ArmBench-LLM 0.1 (legacy), which laid the groundwork by evaluating LLM knowledge using translated MMLU-PRO subsets and Armenian University Exams. Today, version 1.0 drastically expands the scope, tasks, and models evaluated to give the most comprehensive picture of Armenian language understanding to date.

What’s New in 1.0?

The new dataset is significantly larger and meticulously designed to evaluate a wider array of capabilities:

  • Diverse Tasks: Text classification, Multiple Choice QA (MCQA), grammar correction, space fixing (highly useful for OCR-extracted texts), summarization, translation, and reading comprehension.
  • Legacy Tests Included: The original University Exams and MMLU-PRO translations remain integral to the benchmark.
  • Wider Model Coverage: We evaluated almost all major proprietary models via OpenRouter, alongside popular open-source models (including Qwen, GLM, Mistral, and Gemma). Note: Open-source models under 30B parameters were evaluated locally to minimize API spend.

πŸ“Š Leaderboard (Top 10)

Note: size is not reported for propriatory models. image

Full leaderboard + per-task breakdown available in the HF Space.

Key Findings & Leaderboard Insights

  1. Gemini 3 is still the king Google's gemini-3-flash-preview secured the #1 rank with an Average score of 0.6350. Surprisingly, it is also one of the most cost-effective models on the market. For comparison, openai/gpt-5.2-pro took the #2 spot (0.6171) while being almost 50x more expensive 🀯

  2. Open Source is catching up Unlike last year, it seems on-prem Armenian AI is not a significant bottleneck anymore. The standout star of the open-source ecosystem is qwen/qwen3.5-27b. With a score of 0.5767, this 27B parameter model comfortably beat massive 600B+ parameter giants like GLM-5 and Mistral-Large. This means you can now run a strong LLM that also speaks Armenian on a single GPU machine.

  3. Better Globally β‰  Better for Armenian Global rankings don't always translate to Armenian mastery even for the same model family. A prime example is the Gemini 3 family: while Gemini 3 Pro is widely considered superior to Gemini 3 Flash on global leaderboards (e.g. Arena.AI, MMLU-Pro), our results show that the Flash version actually outperforms the Pro version in Armenian proficiency. This underscores the importance of language-specific benchmarking, you cannot simply rely on general-purpose rankings to choose the best model for Armenian tasks.

The Cost of Armenian AI (Spend Report)

image

We also release a spend report based on our (cumulative) API costs to evaluate different models on the benchmark. The report is not all-inclusive as it does not cover the smaller open source models (<30B) which we evaluated locally. Nonetheless, based on our OpenRouter API budget report:

  • Best Budget Option: Grok 4 Fast (Score: 0.6037, Cost: $2.71)
  • Best Overall Value: Gemini 3 Flash (Score: 0.6350, Cost: $3.28)
  • Premium Tier: Claude 3.7 Sonnet offers fantastic reasoning at a moderate price (Score: 0.6071, Cost: $16.49), while Gpt 5.2 Pro asks for a heavy premium ($160.20) for its #2 rank.

When analyzing our Spend Report, it is important to note that the total cost is driven by three primary factors beyond just the model's performance:

  • Unit Pricing: The base cost per 1 million tokens charged by the provider.
  • Tokenizer Efficiency: This is a "hidden" factor for Armenian. Different models use different tokenizers, and some are significantly more efficient at encoding Armenian script than others. A model with a more efficient tokenizer will process the same Armenian text using fewer tokens, leading to lower costs.
  • Reasoning Verbosity: Advanced "reasoning" models tend to generate a significant amount of internal "Chain of Thought" text before arriving at a final answer. While this often boosts accuracy, the increased output volume results in a higher total token count and, consequently, a higher price tag for the evaluation.

Notable Mentions

Beyond the main leaderboard, here are specialized insights that can be crucial if you are building a product with a very specific focus:

  • The Math Skill: If you are looking for quantitative reasoning, the Grok 4.1 and Grok 4 Fast versions are the ones to watch. Both achieved a score of 18.75 on the Math subset.
  • Armenian History and Literature: The Google Gemini family remains the gold standard for Armenian knowledge. Specifically, Gemini 3 Flash takes the lead in Armenian Literature with a score of 12.5, while Gemini 3 Pro is the top performer with a score of 14.5 in Armenian History.
  • Summarization & Generation: For condensing long Armenian texts into concise summaries, as well as for other text generation tasks, Gemini 2.5 Flash is our top recommendation for both speed and quality.
  • Translation: Gemini 3 Flash is the most accurate model for translation.
  • Named Entity Recognition (NER): While GPT-5.2 Pro technically holds the #1 spot for NER, Grok 4 Fast is the 2nd best model in this category. Given the massive price gap between the two, we consider Grok 4 Fast the superior choice for production-scale NER tasks.

Explore the Data

ArmBench-LLM 1.0 is fully open-sourced for the community:

Limitations & Technical Considerations

While ArmBench-LLM 1.0 provides a comprehensive overview, transparency is key to understanding these results. Here are the important caveats regarding our evaluation:

  • Reliability-Based Exclusions (Claude 4.5/4.6): To maintain the highest standards of benchmark integrity, we have excluded Claude 4.5 and 4.6 models from the current leaderboard. During our evaluation phase, we observed significant consistency and reliability issues that compromised the validity of the scores for these specific versions. For a detailed technical breakdown of these observations, please refer to this GitHub Issue.
  • The "Simple Prompt" Constraint for Smaller Models: To ensure a fair and unbiased comparison across the board, we utilized a uniform, simplified prompt for every model. However, smaller models (such as Qwen 3.5 9B) are often more sensitive to prompt structure and typically gain the most from specialized prompt engineering. Because we avoided model-specific optimization, the scores for these compact models are likely lower than what they are capable of achieving with a more tailored approach.
  • Grok 4.20 API vs. UI Experience: During our API-based evaluation, Grok 4.20 exhibited unreliable behavior similar to the issues noted with Claude. Interestingly, this inconsistency was not present in the Grok web interface. If you are using Grok via the UI, you will likely find the Armenian language experience to be significantly more robust and reliable than the current benchmark scores might suggest.

Happy benchmarking β€” and feel free to contribute new tasks or models!
Metric AI Lab - Building Armenian AI, non-commercially. πŸ‡¦πŸ‡²

Community

Sign up or log in to comment