Search papers, labs, and topics across Lattice.
This paper introduces Asuka-Bench, a novel benchmarking framework designed to evaluate code agents under conditions of underspecified user intent and iterative refinement. By simulating real-world web development scenarios, where user requirements evolve through feedback on intermediate outputs, the benchmark assesses eight large language models (LLMs) across 50 web tasks with extensive evaluation criteria. The findings reveal significant performance disparities among models, with a 38 percentage point variance in Task Pass Rate and a maximum project completion rate of only 52% after three rounds of refinement, highlighting the challenges in adapting to user feedback in code generation tasks.
Code agents struggle with evolving user requirements, revealing a 38-point gap in performance across leading LLMs when faced with iterative feedback.
Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.