Search papers, labs, and topics across Lattice.
This paper investigates language-of-study (LoS) bias in NLP peer reviews, where papers are evaluated differently based on the language(s) they study rather than scientific merit. They introduce LOBSTER, a human-annotated dataset for detecting LoS bias, and achieve 87.37 macro F1 for detection. Analysis of 15,645 reviews reveals that non-English papers experience significantly higher rates of both negative and positive LoS bias, with negative bias being more prevalent, especially in the form of unjustified demands for cross-lingual generalization.
Non-English NLP papers face a surprisingly high bias tax in peer review, often stemming from reviewers demanding unjustified cross-lingual generalization.
Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.