Search papers, labs, and topics across Lattice.
The paper investigates the consistency of automated code revision (ACR) tools when presented with semantically equivalent code variants generated through semantics-preserving perturbations (SPP). They created nine types of SPPs and applied them to Java methods from GitHub, evaluating five transformer-based ACR tools. Results show that ACR tools' ability to generate correct revisions can drop significantly (up to 45.3%) with semantically equivalent code, and attention-guiding heuristics offer limited improvement.
Semantically identical code can cause state-of-the-art code revision tools to fail almost half the time, revealing a surprising brittleness in their understanding of code.
Automated Code Revision (ACR) tools aim to reduce manual effort by automatically generating code revisions based on reviewer feedback. While ACR tools have shown promising performance on historical data, their real-world utility depends on their ability to handle similar code variants expressing the same issue - a property we define as consistency. However, the probabilistic nature of ACR tools often compromises consistency, which may lead to divergent revisions even for semantically equivalent code variants. In this paper, we investigate the extent to which ACR tools maintain consistency when presented with semantically equivalent code variants. To do so, we first designed nine types of semantics-preserving perturbations (SPP) and applied them to 2032 Java methods from real-world GitHub projects, generating over 10K perturbed variants for evaluation. Then we used these perturbations to evaluate the consistency of five state-of-the-art transformer-based ACR tools. We found that the ACR tools' ability to generate correct revisions can drop by up to 45.3%, when presented with semantically equivalent code. The closer the perturbation is to this targeted region, the more likely an ACR tool is to fail to generate the correct revision. We explored potential mitigation strategies that modify the input representation, but found that these attention-guiding heuristics yielded only marginal improvements, thus leaving the solution to this problem as an open research question.