Google ResearchArizonaUIUCUMichApr 20, 2026arXiv:2604.17717

Revisiting Code Debloating with Ground Truth-based Evaluation

Muhammad Bilal, Moiz Ali, Mohit Kumar, Fareed Zaffar, Fahad Shaon, Ashish Gehani, Sazzadur Rahaman

AI Summary

This paper revisits application-level code debloating by introducing a ground-truth-based evaluation paradigm to overcome the limitations of traditional proxy metrics like code size and test case coverage. The authors evaluated eight state-of-the-art debloating tools across different transformation paradigms (source-to-source, IR-to-IR, binary-to-binary). Results show that dynamic analysis-based tools often remove code that should be retained, while static analysis-based approaches exhibit high false retention rates, leading to functional incorrectness and potential vulnerabilities.

Key Contribution

Debloating tools, intended to shrink code and improve security, can actually *add* code or remove essential functionality, with dynamic methods being overly aggressive and static methods overly conservative.

Abstract

Program debloating aims to remove unused code to reduce performance overhead, attack surfaces, and maintenance costs. Over time, debloating has evolved across multiple layers (container, library, and application), each building on the principles of application-level debloating. Despite its central role, application-level debloating continues to rely on imperfect proxies for measuring performance, such as test-case-driven evaluation for correctness, code size for runtime efficiency, and gadget count reduction for estimating security posture. While there is widespread skepticism about using such imperfect proxies, the community still lacks standardized methodologies or benchmarks to assess the true performance of application-level software debloating. This experience paper aims to address the gap. We revisit the foundations of application-level debloating through a ground-truth-based evaluation paradigm. Our analysis of eight state-of-the-art debloaters - Blade, Chisel, Cov, CovA, Lmcas, Trimmer, Occam, and Razor - uncovers insights previously unattainable through traditional evaluations. These tools collectively span the spectrum of source-to-source, IR-to-IR, and binary-to-binary transformation paradigms, characterizing a holistic reassessment across abstraction levels. Our analysis reveals that while dynamic analysis-based tools often remove up to 94% of code that should be retained, static analysis-based approaches exhibit the opposite behavior, showing high false retention rates due to coarse-grained dependency over-approximation. Additionally, static analyses may add code by introducing specialized variants of functions. False retentions and removals not only cause functional incorrectness but may also lead to systematic inconsistency, robustness failures, and exploitable vulnerabilities.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Revisiting Code Debloating with Ground Truth-based Evaluation

Related Papers