NUSConcordia UniversityMcGillPoly MontrealUCFYorkSep 29, 2025arXiv:2509.25465

BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

Yinghang Ma, Jiho Shin, L. D. Silva, Zhen Ming Jiang, Song Wang, F. Khomh, Shin Hwei Tan

AI Summary

The paper introduces BloomAPR, a dynamic evaluation framework for LLM-powered Automated Program Repair (APR) solutions based on Bloom's Taxonomy, to address limitations of static benchmarks like data contamination and limited diversity. BloomAPR assesses APR capabilities across progressively complex reasoning levels. The framework was used to evaluate ChatRepair and CigaR with GPT-3.5-Turbo, Llama-3.1, and StarCoder-2, revealing that while these solutions perform well on basic reasoning and memorization, they struggle with syntactic changes and real-world project integration.

Key Contribution

LLM-powered program repair tools can memorize bug fixes, but struggle with minor syntactic changes and applying fixes in real-world projects, revealing a critical gap in their reasoning abilities.

Abstract

Recent advances in large language models (LLMs) have accelerated the development of AI-driven automated program repair (APR) solutions. However, these solutions are typically evaluated using static benchmarks such as Defects4J and SWE-bench, which suffer from two key limitations: (1) the risk of data contamination, potentially inflating evaluation results due to overlap with LLM training data, and (2) limited ability to assess the APR capabilities in dynamic and diverse contexts. In this paper, we introduced BloomAPR, a novel dynamic evaluation framework grounded in Bloom's Taxonomy. Our framework offers a structured approach to assess the cognitive capabilities of LLM-powered APR solutions across progressively complex reasoning levels. Using Defects4J as a case study, we evaluated two state-of-the-art LLM-powered APR solutions, ChatRepair and CigaR, under three different LLMs: GPT-3.5-Turbo, Llama-3.1, and StarCoder-2. Our findings show that while these solutions exhibit basic reasoning skills and effectively memorize bug-fixing patterns (fixing up to 81.57% of bugs at the Remember layer), their performance increases with synthetically generated bugs (up to 60.66% increase at the Understand layer). However, they perform worse on minor syntactic changes (fixing up to 43.32% at the Apply layer), and they struggle to repair similar bugs when injected into real-world projects (solving only 13.46% to 41.34% bugs at the Analyze layer). These results underscore the urgent need for evolving benchmarks and provide a foundation for more trustworthy evaluation of LLM-powered software engineering solutions.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations2

Influential citations0

References79

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

Related Papers