Search papers, labs, and topics across Lattice.
The paper introduces BeTTER, a diagnostic benchmark designed to evaluate true embodied reasoning in Vision-Language-Action (VLA) models by applying targeted causal interventions and enforcing kinematic isolation. Evaluations using BeTTER reveal that state-of-the-art VLAs exhibit failures like lexical-kinematic shortcuts and semantic feature collapse in dynamic scenarios, which are masked by static evaluation protocols. The authors trace these failures to architectural bottlenecks like capacity compression and myopic downsampling that degrade the model's semantic representation, further validating these findings with real-world robotic experiments.
Seemingly impressive VLA performance on robotic benchmarks crumbles when stress-tested with causal interventions, exposing a reliance on brittle shortcuts rather than genuine embodied reasoning.
Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.