Independent researchers *EquallyUniversity of Science and TechnologyUSTCJun 4, 2026arXiv:2606.05920

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Liangtai Sun, Yaoming Zhu, Shuang Zhou, Jiaxing Liu, Fengjiao Chen, Lin Qiu, Xuezhi Cao, Xunliang Cai, Licheng Zhang, Zhendong Mao

AI Summary

This paper introduces Asuka-Bench, a novel benchmarking framework designed to evaluate code agents under conditions of underspecified user intent and iterative refinement. By simulating real-world web development scenarios, where user requirements evolve through feedback on intermediate outputs, the benchmark assesses eight large language models (LLMs) across 50 web tasks with extensive evaluation criteria. The findings reveal significant performance disparities among models, with a 38 percentage point variance in Task Pass Rate and a maximum project completion rate of only 52% after three rounds of refinement, highlighting the challenges in adapting to user feedback in code generation tasks.

Key Contribution

Code agents struggle with evolving user requirements, revealing a 38-point gap in performance across leading LLMs when faced with iterative feedback.

Abstract

Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Related Papers