Search papers, labs, and topics across Lattice.
This study introduces a novel framework for probing and steering the cultural values of large language models (LLMs) using scenario-based behavioral dilemmas, addressing the limitations of traditional survey-style evaluations that often yield neutral responses. By leveraging token-level probabilities and activation steering, the authors demonstrate the ability to shift model behavior in alignment with specific cultural values across different contexts without the need for retraining. The findings reveal significant variations in how steerable different LLMs are and highlight a latent entanglement between cultural dimensions, suggesting that interventions in one area can inadvertently affect another, while maintaining overall task performance.
Cultural values in LLMs can be subtly shifted through scenario-based dilemmas, revealing unexpected interdependencies that challenge traditional alignment approaches.
Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.