Search papers, labs, and topics across Lattice.
This paper identifies that holistic LLM judges conflate utility estimation and aggregation in multi-stakeholder tasks, leading to unstable implicit stakeholder weights and score shifts, especially when stakeholder satisfaction is dispersed. They demonstrate that this "weighting noise" increases with stakeholder count. To address this, they propose \textsc{DecompR}, a method that decouples utility estimation from aggregation by fixing counterfactual-calibrated weights based on query structure before scoring, and estimating per-role utilities independently.
LLM judges in multi-stakeholder settings suffer from "weighting noise" that gets *worse* as you add more stakeholders, but fixing weights upfront can stabilize the process.
Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable implicit weights. We show empirically and theoretically that this aggregation-specific \emph{weighting noise} can create large score shifts when stakeholder satisfaction is dispersed; in our experiments, these weight-induced shifts also increase with stakeholder count. We propose \textsc{DecompR}: counterfactual-calibrated weights are fixed from query structure before candidate scoring, while per-role utilities are estimated independently, removing candidate-dependent weight drift and reducing estimation noise.