Search papers, labs, and topics across Lattice.
This paper introduces VISTA, a versatile interactive user simulation toolkit designed to enhance the evaluation of agents by addressing the limitations of existing frameworks. By incorporating both UI and API interactions, VISTA allows for a more comprehensive assessment of agent capabilities and failure modes across dynamic environments. The evaluation results in e-commerce and education settings show that VISTA significantly improves the realism and effectiveness of agent evaluations compared to traditional methods.
VISTA reveals that integrating UI and API interactions can drastically enhance the realism and comprehensiveness of agent evaluations, outperforming existing benchmarks.
Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.