Search papers, labs, and topics across Lattice.
This paper evaluates political bias in multi-document news summarization across 13 LLMs using the FairNews dataset and five fairness metrics. The study challenges the assumption that larger models are fairer, finding that mid-sized models often achieve a better balance of fairness and efficiency. Furthermore, the effectiveness of debiasing interventions, particularly prompt-based methods, is highly model-dependent, and entity sentiment bias proves particularly resistant to mitigation.
Mid-sized LLMs can actually be *more* fair in news summarization than their larger counterparts, challenging the common wisdom of "bigger is better."
Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.