Papers
arxiv:2606.05405

Agents' Last Exam

Published on Jun 3
ยท Submitted by
Han
on Jun 9
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Agents' Last Exam (ALE) is a benchmark for evaluating AI agents on long-term, economically valuable real-world tasks across 13 industry clusters with 1K+ tasks, revealing significant gaps between benchmark performance and practical deployment.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

Community

Paper author Paper submitter

Agents' Last Exam (ALE): can AI agents genuinely do the work of human experts in real-world settings?

A living benchmark built with 300+ experts across 55 industries, yielding 1,500+ real-world tasks. Three things set it apart:

  1. Real origins: every task comes from actual projects experts completed on the job, mapped to the U.S. occupational taxonomy (O*NET).
  2. Unconstrained: generalist computer-use agents get full GUI + CLI and solve tasks however they want, judged on results, not method.
  3. Objective: scored by reproducible, deterministic code evaluators, with no human judge.

Frontier agents pass only 2.6% on the hardest "last-exam" tier, a sobering reality check on the timeline for AI workplace automation. We call it the "Last Exam" as saturating it means agents can genuinely power real industries.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05405
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05405 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05405 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05405 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.