Tsinghua AISJTUWasedaJun 2, 2026arXiv:2606.03220

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Yu Meng, Yuxin Meng, Yuhan Suo, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Yiyao Yu, Ruixu Zhang, Ruixu Zhang, Ruining Hu, Ruining Hu, Yubin Wang, Shouwei Ruan, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yuxiang Zhang, Yujiu Yang, Yujiu Yang

AI Summary

This paper introduces WebRISE, a novel evaluation framework that utilizes Interaction Contract Graphs (ICGs) to assess the effectiveness of MLLM-generated web artifacts by focusing on requirement-induced states and transitions. By analyzing 442 tasks across various input modalities, the study reveals that even the most advanced models achieve only 65.6% transition validity and 66.3% requirement coverage, highlighting a significant gap between visual quality and functional behavior. The findings underscore the importance of incorporating user-intent transitions and requirement checks to better evaluate the performance of MLLMs in real-world applications.

Key Contribution

Even the best MLLMs struggle to meet user requirements, achieving only 66% coverage of essential task functions.

Abstract

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Related Papers