AalborgKUJun 9, 2026arXiv:2606.11399

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Trung Duc Anh Dang, Tung Kieu, Sarah Masud

AI Summary

This study introduces a novel framework for probing and steering the cultural values of large language models (LLMs) using scenario-based behavioral dilemmas, addressing the limitations of traditional survey-style evaluations that often yield neutral responses. By leveraging token-level probabilities and activation steering, the authors demonstrate the ability to shift model behavior in alignment with specific cultural values across different contexts without the need for retraining. The findings reveal significant variations in how steerable different LLMs are and highlight a latent entanglement between cultural dimensions, suggesting that interventions in one area can inadvertently affect another, while maintaining overall task performance.

Key Contribution

Cultural values in LLMs can be subtly shifted through scenario-based dilemmas, revealing unexpected interdependencies that challenge traditional alignment approaches.

Abstract

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

Constitutional AI & AI Ethics RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Related Papers