Feb 26, 2026arXiv:2602.22831

Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Phil Blandfort, Tushar Karayil, Tushar Karayil, Urja Pawar, Urja Pawar, Robert Graham, R. Graham, Alex McKenzie, Alex McKenzie, Dmitrii Krasheninnikov, D. Krasheninnikov

AI Summary

This paper investigates how contextual cues in prompts influence LLMs' moral decisions in trolley-problem scenarios, challenging the assumption of stable preferences in context-free benchmarks. The authors introduce a novel evaluation harness using direction-flipped contextual influences related to demographic factors to measure directional response. They find that LLMs' moral choices are significantly swayed by even superficially relevant contextual cues, that baseline preferences poorly predict steerability, and that reasoning can amplify biases from few-shot examples.

Key Contribution

LLMs' apparent moral neutrality is a mirage: even when they claim impartiality, subtle contextual cues can dramatically sway their decisions in moral dilemmas, sometimes even in the opposite direction of the cue.

Abstract

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Moral Preferences of LLMs Under Directed Contextual Influence

Related Papers