Centaur AIMay 28, 2026arXiv:2605.30058

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Wei Peng, Chenxu Zhang, Qian'ao Wang, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

AI Summary

The paper introduces HEART-Bench, a new benchmark to evaluate if LLM agents can simulate coherent, human-like psychology based on the Big Five personality traits. The benchmark consists of 11 diverse human characters, each with 1,000 autobiographical memories, and 64 decision-making scenarios guided by the DIAMONDS taxonomy. Experiments using the benchmark assess whether LLMs can consolidate personality traits and memories to make behavioral decisions consistent with their psychological profiles, as validated by human evaluations.

Key Contribution

LLM agents struggle to consistently reflect human-like psychology, even when provided with extensive personality profiles and autobiographical memories, suggesting current models lack a deeper understanding of human behavior.

Abstract

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Related Papers