Search papers, labs, and topics across Lattice.
This paper introduces ALMANAC, a novel dataset designed to enhance agent collaboration by providing action-level mental model annotations derived from the Map Task, a well-established dyadic routing task. By capturing 2,987 collaboration actions along with detailed annotations of participants' reasoning, partner intentions, and team goals, the dataset addresses the critical gap in authentic human collaboration data necessary for training more competent collaborative agents. Benchmarking six LLMs on their ability to predict human behavior and mental models reveals ALMANAC's effectiveness in evaluating and improving the collaborative capabilities of AI agents.
ALMANAC reveals that agents can significantly improve their collaborative competence by learning from detailed human mental model annotations.
Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.