Apr 28, 2026arXiv:2604.25349

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

AI Summary

This paper critically examines the widespread use of the Wilcoxon signed-rank test in Information Retrieval (IR) research, arguing that its perceived safety as a non-parametric alternative to the t-test is fundamentally flawed. Through a systematic literature review and empirical analysis using TREC data, the authors demonstrate that the Wilcoxon test often fails to maintain control over its Type I error rate in IR contexts, leading to misleading conclusions. The findings suggest that the continued reliance on Wilcoxon in IR evaluation is unjustified and that abandoning this practice would enhance the methodological rigor of the field.

Key Contribution

Misapplying the Wilcoxon test in IR research could lead to a false sense of security, resulting in misleading outcomes that undermine the validity of findings.

Abstract

In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References76

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

Related Papers