Search papers, labs, and topics across Lattice.
This paper introduces LLM-powered random GUI testing, a hybrid approach that uses LLMs to suggest events for escaping UI tarpits, a common problem in mobile app testing where exploration gets stuck in local regions. They monitor UI similarity to identify tarpits and then prompt the LLM to suggest actions to escape. Experiments on 12 real-world apps show that this approach significantly improves code coverage (54.8% and 44.8% improvements for HybridMonkey and HybridDroidbot, respectively) and bug detection compared to baseline random testing methods.
LLMs can help automated mobile app testing break free from UI "tarpits," leading to 50% higher code coverage and the discovery of dozens of previously unknown bugs in real-world apps.
Random GUI testing is a widely-used technique for testing mobile apps. However, its effectiveness is limited by the notorious issue -- UI exploration tarpits, where the exploration is trapped in local UI regions, thus impeding test coverage and bug discovery. In this experience paper, we introduce LLM-powered random GUI Testing, a novel hybrid testing approach to mitigating UI tarpits during random testing. Our approach monitors UI similarity to identify tarpits and query LLMs to suggest promising events for escaping the encountered tarpits. We implement our approach on top of two different automated input generation (AIG) tools for mobile apps: (1) HybridMonkey upon Monkey, a state-of-the-practice tool; and (2) HybridDroidbot upon Droidbot, a state-of-the-art tool. We evaluated them on 12 popular, real-world apps. The results show that HybridMonkey and HybridDroidbot outperform all baselines, achieving average coverage improvements of 54.8% and 44.8%, respectively, and detecting the highest number of unique crashes. In total, we found 75 unique bugs, including 34 previously unknown bugs. To date, 26 bugs have been confirmed and fixed. We also applied HybridMonkey on WeChat, a popular industrial app with billions of monthly active users. HybridMonkey achieved higher activity coverage and found more bugs than random testing.