TU DelftVNU University of Engineering and TechnologyApr 21, 2026arXiv:2604.19663

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Quang-Huy Nguyen, Thanh-Hai Nguyen, Khac-Manh Thai, Duc-Hoang Pham, Huy-Son Nguyen, Cam-Van Thi Nguyen, Masoud Mansoury, Duc-Trong Le, Hoang-Quynh Le

AI Summary

This paper presents a systematic reproducibility study and benchmark of 11 counterfactual explanation (CE) methods for recommender systems, encompassing both native and graph-based explainers. They propose a unified benchmarking framework evaluating explainers based on explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Results reveal that the effectiveness-sparsity trade-off is highly dependent on the method and evaluation setting, and that graph-based explainers face scalability issues on large graphs, challenging previous conclusions about CE robustness.

Key Contribution

Counterfactual explainers for recommender systems don't generalize as well as we thought: their effectiveness and sparsity depend heavily on the evaluation setting, and graph-based methods struggle to scale.

Abstract

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Related Papers