Mar 10, 2026arXiv:2603.09643

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

AI Summary

The paper introduces MM-tau-p$^2$, a new benchmark for evaluating multi-modal LLM agents in dual-control settings, emphasizing persona adaptation and user input during planning. It highlights the limitations of current evaluation frameworks that overlook user persona and multi-modal robustness. The benchmark uses 12 novel metrics and LLM-as-judge approach to evaluate agents in telecom and retail domains, revealing challenges even with advanced models like GPT-5 and GPT 4.1.

Key Contribution

Even GPT-5 struggles with multi-modal robustness and turn overhead when user personas and multi-modal inputs are considered in agent evaluation, revealing critical gaps in current LLM agent capabilities.

Abstract

Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Related Papers