CASApr 9, 2026arXiv:2604.08362

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Leilei Sun, Le Sun, Xiangyu Wu, Xian Wu, Hongyu Lin

AI Summary

The paper introduces OmniBehavior, a novel benchmark for user simulation built from real-world, long-horizon, cross-scenario behavioral data. It demonstrates that existing benchmarks suffer from "tunnel vision" and that current LLMs struggle to simulate complex real-world human behavior, even with large context windows. The analysis reveals a structural bias in LLMs, causing them to converge towards a positive average person, leading to hyper-activity, persona homogenization, and a Utopian bias.

Key Contribution

LLMs exhibit a "Utopian bias" when simulating human behavior, converging towards an unrealistic "positive average person" and failing to capture individual differences and long-tail behaviors.

Abstract

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Related Papers