Search papers, labs, and topics across Lattice.
The authors introduce demonstratives (e.g., "this/that") as a probe to evaluate whether LLMs capture embodied cognition and cultural conventions. They establish human baselines for English and Chinese speakers, revealing cross-cultural asymmetries in demonstrative interpretation related to perspective-taking and distal ambiguity. In contrast to human performance, five state-of-the-art LLMs failed to demonstrate an understanding of proximal-distal contrasts or cultural differences, exhibiting an English-centric bias.
LLMs fail to grasp basic spatial concepts and cultural nuances encoded in demonstratives like "this" and "that," revealing a surprising lack of embodied cognition despite their vast training data.
Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like"this/that"in English and"zh\`e/n\`a"in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.