Search papers, labs, and topics across Lattice.
The paper identifies a systematic failure in LLM reasoning where salient surface cues override implicit feasibility constraints, termed "heuristic override." Through causal-behavioral analysis and a new benchmark (HOB), they show that LLMs rely on context-independent sigmoid heuristics, with surface cues like distance significantly outweighing goals. Minimal hints improve performance, suggesting the issue lies in constraint inference rather than knowledge gaps, and goal-decomposition prompting offers a partial remedy.
LLMs are surprisingly bad at reasoning about everyday scenarios, consistently choosing nonsensical actions (like walking to a car wash) because they're overly influenced by simple heuristics like distance, even when it violates obvious constraints.
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem''across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.