D observations. In contrastD-Perception toHITMay 6, 2026arXiv:2605.05126

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao, Liqiang Nie

AI Summary

The paper introduces ConsisVLA-4D, a novel Vision-Language-Action (VLA) framework designed to improve spatiotemporal consistency in 3D perception and 4D reasoning for robotic manipulation by addressing limitations in existing 2D-centric VLA models. ConsisVLA-4D incorporates three key modules: CV-Aligner for cross-view object semantic consistency, CO-Fuser for cross-object spatial geometric consistency, and CS-Thinker for cross-scene spatiotemporal consistency. Experiments on the LIBERO benchmark and real-world platforms demonstrate that ConsisVLA-4D achieves significant performance improvements (21.6% and 41.5%) and inference speedups (2.3x and 2.4x) compared to OpenVLA.

Key Contribution

Robotic manipulation gets a serious upgrade: ConsisVLA-4D boosts performance by up to 41.5% and speeds up inference by 2.4x, all while ensuring your robot understands the scene in 3D *and* how it changes over time.

Abstract

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Related Papers