Tsinghua AIPKUApr 20, 2026arXiv:2604.18000

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Haiweng Xu, Sipeng Zheng, Haoming Luo, Hao Luo, Wanpeng Zhang, Ziheng Xi, Zongqing Lu

AI Summary

The paper introduces BeTTER, a diagnostic benchmark designed to evaluate true embodied reasoning in Vision-Language-Action (VLA) models by applying targeted causal interventions and enforcing kinematic isolation. Evaluations using BeTTER reveal that state-of-the-art VLAs exhibit failures like lexical-kinematic shortcuts and semantic feature collapse in dynamic scenarios, which are masked by static evaluation protocols. The authors trace these failures to architectural bottlenecks like capacity compression and myopic downsampling that degrade the model's semantic representation, further validating these findings with real-world robotic experiments.

Key Contribution

Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.

Abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.

Eval Frameworks & Benchmarks Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

Related Papers