Feb 24, 2026arXiv:2602.21015

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Maojia Song, Yihuai Lan, Yihuai Lan, Lei Wang, Lei Wang, Zhiqiang Hu, Zhiqiang Hu, Yao Xiao, Heng Zhou, Heng Zhou, Weihua Zheng, Weihua Zheng, Dylan Raharja, Dylan Raharja, Soujanya Poria, Soujanya Poria

AI Summary

The paper introduces CHAIN, a new interactive benchmark for evaluating vision reasoning in dynamic, physics-driven 3D environments, focusing on tasks like interlocking puzzles and 3D packing. CHAIN assesses a model's ability to understand, plan, and execute structured action sequences grounded in physical constraints, moving beyond static perception-based evaluations. Experiments with state-of-the-art VLMs and diffusion models reveal limitations in internalizing physical structure, causal constraints, and translating perception into effective actions for long-horizon planning.

Key Contribution

Current VLMs can ace image quizzes, but completely fumble when asked to stack blocks in a physically plausible way, revealing a critical gap in understanding real-world physics.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents'ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Related Papers