Concordia UniversityPoly MontrealMar 15, 2026arXiv:2603.14191

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Dectot--Le Monnier de Gouville Esteban, Mohammad Hamdaqa, Moataz Chouchen

AI Summary

This paper performs a large-scale analysis of the YARA rule ecosystem by mining 8.4 million rules from 1,853 GitHub repositories, statically analyzing their syntax, and dynamically benchmarking their effectiveness against malware and goodware samples. The study reveals a highly centralized and largely inactive ecosystem with significant noise (false positives) and low recall, biased toward legacy threats. The authors argue for a shift towards rigorous rule engineering and release their dataset and pipeline to support future data-driven curation tools.

Key Contribution

Despite high static quality scores, YARA rules in the wild suffer from significant noise, low recall, and a bias towards legacy threats, exposing a "double penalty" for defenders.

Abstract

YARA has established itself as the de facto standard for "Detection as Code," enabling analysts and DevSecOps practitioners to define signatures for malware identification across the software supply chain. Despite its pervasive use, the open-source YARA ecosystem remains characterized by ad-hoc sharing and opaque quality. Practitioners currently rely on public repositories without empirical evidence regarding the ecosystem's structural characteristics, maintenance and diffusion dynamics, or operational reliability. We conducted a large-scale mixed-method study of 8.4 million rules mined from 1,853 GitHub repositories. Our pipeline integrates repository mining to map supply chain dynamics, static analysis to assess syntactic quality, and dynamic benchmarking against 4,026 malware and 2,000 goodware samples to measure operational effectiveness. We reveal a highly centralized structure where 10 authors drive 80% of rule adoption. The ecosystem functions as a "static supply chain": repositories show a median inactivity of 782 days and a median technical lag of 4.2 years. While static quality scores appear high (mean = 99.4/100), operational benchmarking uncovers significant noise (false positives) and low recall. Furthermore, coverage is heavily biased toward legacy threats (Ransomware), leaving modern initial access vectors (Loaders, Stealers) severely underrepresented. These findings expose a systemic "double penalty": defenders incur high performance overhead for decayed intelligence. We argue that public repositories function as raw data dumps rather than curated feeds, necessitating a paradigm shift from ad-hoc collection to rigorous rule engineering. We release our dataset and pipeline to support future data-driven curation tools.

Code Generation & Program Synthesis Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mining the YARA Ecosystem: From Ad-Hoc Sharing to Data-Driven Threat Intelligence

Related Papers