Tsinghua AIImperial Global SingaporeNankai UniversityNTUSCUSMUMar 29, 2026arXiv:2603.27549

Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

Wenbo Guo, Zhongwen Chen, Zhengzi Xu, Chengwei Liu, Ming Kang, Shiwen Song, Chengyue Liu, Yijia Xu, Weisong Sun

AI Summary

This paper introduces a benchmark dataset of 6,420 malicious and 7,288 benign NPM packages to rigorously evaluate 8 malware detection tools across 13 variants. Through quantitative analysis and source code inspection, the authors reveal that detection performance hinges on how tools resolve ambiguity between code behavior and intent, with behavioral chains significantly improving detection rates. The study also finds that malware simplicity and lack of pre-publication scanning contribute to detection challenges, and that strategic tool combinations, rather than paradigm diversity, yield the best results.

Key Contribution

NPM malware detection tools often fail because they struggle to distinguish between innocuous code behavior and malicious intent, a problem addressable by analyzing behavioral chains.

Abstract

The NPM ecosystem has become a primary target for software supply chain attacks, yet existing detection tools are evaluated in isolation on incompatible datasets, making cross-tool comparison unreliable. We conduct a benchmark-driven empirical analysis of NPM malware detection, building a dataset of 6,420 malicious and 7,288 benign packages annotated with 11 behavior categories and 8 evasion techniques, and evaluating 8 tools across 13 variants. Unlike prior work, we complement quantitative evaluation with source-code inspection of each tool to expose the structural mechanisms behind its performance. Our analysis reveals five key findings. Tool precision-recall positions are structurally determined by how each tool resolves the ambiguity between what code can do and what it intends to do, with GuardDog achieving the best balance at 93.32% F1. A single API call carries no directional intent, but a behavioral chain such as collecting environment variables, serializing, and exfiltrating disambiguates malicious purpose, raising SAP_DT detection from 3.2% to 79.3%. Most malware requires no evasion because the ecosystem lacks mandatory pre-publication scanning. ML degradation stems from concept convergence rather than concept drift: malware became simpler and statistically indistinguishable from benign code in feature space. Tool combination effectiveness is governed by complementarity minus false-positive introduction, not paradigm diversity, with strategic combinations reaching 96.08% accuracy and 95.79% F1. Our benchmark and evaluation framework are publicly available.

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

Related Papers