Papers
arxiv:2606.02060

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Published on Jun 1
· Submitted by
Jiaheng Liu
on Jun 4
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation.

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

Community

Paper submitter

image

image

The claim-to-evidence attribution is the right granularity — moving from "was the answer right" to "which span made it unreliable" is exactly the process-level view that final-answer eval throws away. One boundary worth naming about what DRIFT can certify, though.

Checking a claim against the trajectory's own evidence catches two of the three failure shapes cleanly: the unsupported claim (no backing span) and the conflicting claim (contradicts another span). Both are internal-consistency failures, and span localization nails them. The one it's structurally blind to is the supported-but-wrong claim — where a search returned a confident-but-false snippet and the agent's claim faithfully rests on it. The support check passes, because the claim really is grounded in the trajectory; the trajectory is just wrong about the world. Auditing claims against the evidence the agent itself gathered is still auditing its account against its account, one level up from the final answer.

Where this turns from a caveat into something useful: DRIFT already does the expensive half. It isolates which claim depends on which evidence span and which of those sit on the answer path. That is exactly the targeting you'd want for an external check — take the high-impact supported spans and re-derive the evidence itself against a source outside the trajectory (re-run the lookup, hit the primary source, a second retriever the agent never called). The attribution tells you where to spend the costly independent verification; the re-derivation tells you whether a well-supported claim is actually true. The two compose: claim→evidence closes internal consistency, evidence→world closes the shared-error gap the trajectory can't see by construction.

the claim ledger in DRIFT is a nice hinge between what agents say and what the evidence actually supports. my main question: how does DRIFT handle retroactive updates when later spans overturn earlier claims, potentially shifting which span is the true first harmful one? if a late piece of evidence contradicts an earlier claim, would the evaluation reattribute harm to a different span and does that affect first-error accuracy meaningfully? an ablation where you lock the dependency graph and test forward vs backward propagation could reveal how brittle the span localization is to the tracing step. btw the arxivlens breakdown helped me parse the method details, and i found a solid walkthrough on arxivlens that covers this well: https://arxivlens.com/PaperView/Details/where-do-deep-research-agents-go-wrong-span-level-error-localization-in-agent-trajectories-4816-8ff3a1a1

@librarian-bot recommend

·

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal's retroactive-update question is the sharp one, and it points at the same place from a different side. "Which span was first-harmful" isn't observed — it's a verdict DRIFT computes from the dependency graph it built, so when a late span overturns an earlier claim, the first-error label can move, and first-error accuracy ends up measuring the stability of the tracing step as much as the error itself. The lock-the-graph, forward-vs-backward ablation is exactly the right probe for that.

It also ties back to the supported-but-wrong gap. If you take the high-impact supported spans and re-derive their evidence against a source outside the trajectory, you don't only catch claims that are grounded-but-false — you get an external anchor for which span actually introduced the divergence, independent of how the internal graph propagates blame. The retroactive reshuffling here is a symptom of attributing first-error purely from internal dependencies; an outside check on the contested spans gives the propagation something it can't reshuffle around. So the ablation tells you how brittle the internal tracing is, and external re-derivation on the contested spans is what you'd reach for once it turns out to be.

This comment has been hidden

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02060
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02060 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02060 in a Space README.md to link it from this page.

Collections including this paper 4