Search papers, labs, and topics across Lattice.
This paper investigates the challenges of aligning LLMs with expert human judgment in subjective evaluation tasks. Through expert evaluations and questionnaires, they identify four key patterns: varying alignment difficulty across experts, limited impact of explicit criteria, sensitivity of editing to example selection, and dimension-dependent alignment difficulty. The findings highlight that the difficulty of expert alignment stems not only from model limitations but also from the inherent heterogeneity and tacit nature of subjective human evaluation.
Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.
Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model's prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.