Papers
arxiv:2602.06075

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Published on Feb 3
· Submitted by
Guangyi Liu
on Feb 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A comprehensive memory-focused benchmark for mobile GUI agents reveals significant memory capability gaps and provides systematic evaluation methods and design insights.

AI-generated summary

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06075 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06075 in a Space README.md to link it from this page.

Collections including this paper 1