Apr 29, 2026arXiv:2604.27006

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, Erika Yahata

AI Summary

This paper benchmarks 12 LLMs from major providers against classical ML models on the task of study screening for software engineering systematic literature reviews (SLRs). The authors find significant performance variability across LLMs and sensitivity to input features, with abstract availability being the most crucial factor. Surprisingly, LLMs did not consistently outperform classical models, suggesting that their adoption requires careful justification based on operational constraints and pilot validation.

Key Contribution

LLMs don't automatically win at study screening for software engineering SLRs: their performance is highly variable, sensitive to input data, and not consistently better than classical models.

Abstract

Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models. Objective: To assess LLM performance and variability in screening, quantify the impact of input metadata (abstract, title, keywords), and compare LLMs with classical classifiers under a shared protocol. Methods: We analyzed 12 LLMs from 4 providers (OpenAI, Google Gemini, Anthropic, Llama) and 4 classical models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on 2 real Systematic Literature Reviews (SLRs), totaling 518 papers. The experimental design investigated 3 critical dimensions: (i) LLMs performance variability, (ii) the impact of input feature composition (abstract, title, and keywords) on LLM performance, and (iii) the real gain of using LLMs instead of more traditional classification models. Results: LLMs exhibited substantial heterogeneity and residual non-determinism even at temperature zero. Abstract availability was decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Compared to classical models, performance differences were not consistent enough to support generalizable LLM superiority. Discussion: LLM adoption should be justified by operational and governance constraints (reproducibility, cost, metadata availability), supported by pilot validation and explicit reporting of variability and input configuration.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

Related Papers