Papers
arxiv:2601.08806

APEX-SWE

Published on Jan 13
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

APEX-SWE benchmark evaluates AI models on complex software engineering tasks involving system integration and observability, with performance linked to epistemic discipline and verification capabilities.

AI-generated summary

We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eleven frontier models for the APEX-SWE leaderboard. Claude Opus 4.6 leads the APEX-SWE leaderboard with 40.5% Pass@1, followed by Claude Opus 4.5 at 38.7%. Our analysis shows that strong performance is primarily driven by epistemic discipline, defined as the capacity to distinguish between assumptions and verified facts. It is often combined with systematic verification prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08806 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08806 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.