Search papers, labs, and topics across Lattice.
The paper introduces Fragile, a benchmark to evaluate the framing sensitivity of LLMs across value-tinted narration, temporal slice, and narrative vividness. Experiments using Fragile reveal that LLMs exhibit significant framing sensitivity, with a 28.6% average decision flip rate, and that common interventions exacerbate the problem. To mitigate this, the authors propose Valign, a representation-level method that anchors decisions to a stable value prior and projects out framing-sensitive directions from hidden states, reducing decision flips.
LLMs are surprisingly susceptible to irrelevant framing details, flipping decisions nearly 30% of the time, and naive attempts to fix it only make things worse.
Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.