Search papers, labs, and topics across Lattice.
This paper evaluates LLMs' ability to identify and rank human values expressed in ethnographic interviews, using the Schwartz Theory of Basic Values as a framework. They compare LLM outputs to expert annotations, focusing on both performance metrics (F1, Jaccard, RBO) and the alignment of uncertainty patterns. Results indicate that while LLMs achieve near-human performance on set-based metrics, they struggle with accurate value ranking and exhibit divergent uncertainty structures compared to experts, suggesting potential value biases.
LLMs can almost identify the *presence* of human values in qualitative data as well as experts, but their *ranking* of those values and associated uncertainty is still far off, hinting at hidden biases.
Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals'values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.