Center of Excellence for Generative AIApr 6, 2026arXiv:2604.05159

Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

Alfonso Amayuelas, Firas Laakom, Piotr Pikekos, Wenyi Wang, Yuhui Wang, Jurgen Schmidhuber, W. Wang

AI Summary

The paper introduces CovQValue, a novel method for LLM-based test generation that uses curiosity-driven planning to overcome the limitations of greedy coverage maximization. CovQValue treats the program's branch structure as an unknown environment and uses an evolving coverage map to guide the LLM in generating diverse candidate plans, selecting the most informative one based on LLM-estimated Q-values. Experiments on TestGenEval Lite and a new benchmark, RepoExploreBench, demonstrate that CovQValue significantly outperforms greedy selection, achieving 51-77% higher branch coverage and 40-74% on the new benchmark.

Key Contribution

LLMs can explore codebases far more effectively by planning test sequences that balance immediate coverage with long-term reachability, boosting branch coverage by up to 77% compared to greedy methods.

Abstract

The use of LLMs for code generation has naturally extended to code testing and evaluation. As codebases grow in size and complexity, so does the need for automated test generation. Current approaches for LLM-based test generation rely on strategies that maximize immediate coverage gain, a greedy approach that plateaus on code where reaching deep branches requires setup steps that individually yield zero new coverage. Drawing on principles of Bayesian exploration, we treat the program's branch structure as an unknown environment, and an evolving coverage map as a proxy probabilistic posterior representing what the LLM has discovered so far. Our method, CovQValue, feeds the coverage map back to the LLM, generates diverse candidate plans in parallel, and selects the most informative plan by LLM-estimated Q-values, seeking actions that balance immediate branch discovery with future reachability. Our method outperforms greedy selection on TestGenEval Lite, achieving 51-77% higher branch coverage across three popular LLMs and winning on 77-84% of targets. In addition, we build a benchmark for iterative test generation, RepoExploreBench, where they achieve 40-74%. These results show the potential of curiosity-driven planning methods for LLM-based exploration, enabling more effective discovery of program behavior through sequential interaction

Code Generation & Program Synthesis Eval Frameworks & Benchmarks World Models & Planning

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Planning to Explore: Curiosity-Driven Planning for LLM Test Generation

Related Papers