CASCode [si-etal-2025-design2code] and WebDUTIntroduction With the advancement of multimodalShenzhen Institute of AdvancedUNSWApr 30, 2026arXiv:2604.27419

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Qiyao Wang, Haoran Hu, Longze Chen, Hongbo Wang, Hamid Alinejad-Rokny, Yuan Lin, Min Yang

AI Summary

The paper introduces InteractWeb-Bench, a new multimodal interactive benchmark designed to evaluate website generation by agents under realistic, non-expert user conditions with ambiguous and potentially contradictory instructions. It simulates diverse user behaviors through persona-driven instruction perturbations and provides an interactive environment with actions like Clarify, Implement, Verify, and Submit. Experiments show that current MLLM-based agents struggle with "blind execution" due to limitations in intent recognition and adaptive interaction, highlighting the need for improved models.

Key Contribution

Today's best multimodal agents still fall into "blind execution" traps when building websites from ambiguous user requests, revealing critical gaps in intent recognition and adaptive interaction.

Abstract

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Related Papers