Search papers, labs, and topics across Lattice.
This paper investigates the role of anthropomorphic reflection markers (e.g., "wait," "hmm") in LLM reasoning by suppressing these markers at both the prompt and token levels. The study evaluates the impact of this suppression on performance across four benchmarks and two model scales. The key finding is that these markers are not uniformly necessary for reasoning and their suppression can even improve performance, suggesting they are surface cues rather than reliable indicators of genuine reflection.
LLMs don't need "wait, let me think..." to reason鈥攊n fact, dropping the cutesy anthropomorphic markers can actually *improve* their performance.
Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.