Mar 31, 2026arXiv:2603.29623

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

Xiangyang Xiao, Huaxun Huang, Rongxin Wu

AI Summary

The paper introduces LTGDroid, a novel approach to enhance LLM-based bug reproduction in Android apps by pre-assessing the visual effects of UI actions. LTGDroid executes all possible UI actions, records their visual effects, and uses these cues to guide the LLM (GPT-4.1) in selecting actions likely to reproduce a given bug. Experiments on 75 bug reports show LTGDroid achieves an 87.51% reproduction success rate, significantly outperforming state-of-the-art baselines.

Key Contribution

LLMs can now reproduce Android app bugs with 87% accuracy, thanks to pre-assessing the visual effects of UI actions.

Abstract

In the development and maintenance of Android apps, the quick and accurate reproduction of user-reported bugs is crucial to ensure application quality and improve user satisfaction. However, this process is often time-consuming and complex. Therefore, there is a need for an automated approach that can explore the Application Under Test (AUT) and identify the correct sequence of User Interface (UI) actions required to reproduce a bug, given only a complete bug report. Large Language Models (LLMs) have shown remarkable capabilities in understanding textual and visual semantics, making them a promising tool for planning UI actions. Nevertheless, our study shows that even when using state-of-the-art LLM-based approaches, these methods still struggle to follow detailed bug reproduction instructions and replan based on new information, due to their inability to accurately predict and interpret the visual effects of UI components. To address these limitations, we propose LTGDroid. Our insight is to execute all possible UI actions on the current UI page during exploration, record their corresponding visual effects, and leverage these visual cues to guide the LLM in selecting UI actions that are likely to reproduce the bug. We evaluated LTGDroid, instantiated with GPT-4.1, on a benchmark consisting of 75 bug reports from 45 popular Android apps. The results show that LTGDroid achieves a reproduction success rate of 87.51%, improving over the state-of-the-art baselines by 49.16% and 556.30%, while requiring an average of 20.45 minutes and approximately $0.27 to successfully reproduce a bug. The LTGDroid implementation is publicly available at https://github.com/N3onFlux/LTGDroid.

Code Generation & Program Synthesis Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

Related Papers