JetBrains ResearchTU DelftApr 23, 2026arXiv:2604.21579

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

Milan De Koning, Milan de Koning, Ali Asgari, Ali Asgari, P. Derakhshanfar, Pouria Derakhshanfar, Annibale Panichella, Annibale Panichella

AI Summary

This paper introduces a metamorphic testing approach, combined with negative log-likelihood (NLL) analysis, to diagnose memorization in LLM-based program repair. Semantics-preserving transformations are applied to the Defects4J and GitBug-Java datasets to create variant benchmarks, which are then used to evaluate the performance of seven LLMs. Results demonstrate that LLMs exhibit significant performance degradation on transformed benchmarks, and this degradation correlates strongly with NLL on the original benchmarks, indicating memorization.

Key Contribution

LLMs' apparent success at program repair crumbles when faced with slightly altered versions of known bugs, revealing a reliance on memorization rather than true understanding.

Abstract

LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization. We construct variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java. Using these benchmarks, we evaluate the repair success rates of seven LLMs on both original and transformed versions, and analyze the relationship between performance degradation and NLL. Our results show that all evaluated state-of-the-art LLMs exhibit substantial drops in patch generation success rates on transformed benchmarks, ranging from -4.1% for GPT-4o to -15.98% for Llama-3.1. Furthermore, we find that this degradation strongly correlates with NLL on the original benchmarks, suggesting that models perform better on instances they are more likely to have memorized. These findings show that combining MT with NLL provides stronger and more reliable evidence of data leakage, while metamorphic testing alone can help mitigate its effects in LLM-based APR evaluations.

Code Generation & Program Synthesis Data Curation & Synthetic Data Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References65

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

Related Papers