Papers
arxiv:2601.08118

MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

Published on Jan 13
· Submitted by
Ashutosh Hathidara
on Jan 23
Authors:
,
,
,
,

Abstract

Large language models are evaluated as user simulators through a reproducible benchmark framework that assesses their ability to generate human-like conversational responses across diverse tasks.

AI-generated summary

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.

Community

The framework is open-sourced at https://github.com/SAP/mirrorbench

Paper submitter

MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.

GitHub Repo: https://github.com/SAP/mirrorbench

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08118 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.08118 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08118 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.