Saarland UniversityFeb 25, 2026arXiv:2602.21833

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

Norman Peitek, Norman Peitek, Julia Hess, Julia Hess, S. Apel, Sven Apel

AI Summary

This paper investigates the use of GPT5.1 for iterative code readability refactoring on 230 systematically varied Java snippets across five iterations using three prompting strategies. The study categorizes code changes into implementation, syntactic, and comment-level transformations, and assesses functional correctness and robustness. Results show an initial restructuring phase followed by stabilization, suggesting an internalized understanding of optimal readability, with convergence patterns robust across code variants and influenced by explicit readability prompts.

Key Contribution

GPT-3.5's iterative code refactoring reveals a surprising "restructuring then stabilization" pattern, hinting at an inherent sense of optimal code readability.

Abstract

Large language models (LLMs) are increasingly used for automated code refactoring tasks. Although these models can quickly refactor code, the quality may exhibit inconsistencies and unpredictable behavior. In this article, we systematically study the capabilities of LLMs for code refactoring with a specific focus on improving code readability. We conducted a large-scale experiment using GPT5.1 with 230 Java snippets, each systematically varied and refactored regarding code readability across five iterations under three different prompting strategies. We categorized fine-grained code changes during the refactoring into implementation, syntactic, and comment-level transformations. Subsequently, we investigated the functional correctness and tested the robustness of the results with novel snippets. Our results reveal three main insights: First, iterative code refactoring exhibits an initial phase of restructuring followed by stabilization. This convergence tendency suggests that LLMs possess an internalized understanding of an"optimally readable"version of code. Second, convergence patterns are fairly robust across different code variants. Third, explicit prompting toward specific readability factors slightly influences the refactoring dynamics. These insights provide an empirical foundation for assessing the reliability of LLM-assisted code refactoring, which opens pathways for future research, including comparative analyses across models and a systematic evaluation of additional software quality dimensions in LLM-refactored code.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models

Related Papers