ManchesterMay 28, 2026arXiv:2605.30117

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfang Zhang, Jiayu Hu, Haozhe Shan, Hanning Dong, Jinpeng Lu, Yinda Chen, Y. Zhang, Yong Dai, Xiaozhu Ju

AI Summary

VLA-Trace is introduced as a diagnostic framework for Vision-Language-Action (VLA) models, using representation tracing via CKA, attention knockout interventions, and behavioral probes. The framework reveals distinct modality-specific adaptation dynamics, differing multimodal routing strategies, and limitations in fine-grained semantic following in $\pi_{0.5}$ and OpenVLA. This analysis provides insights into representation-preserving adaptation, causal VLA circuits, and compositional semantic control for VLA models.

Key Contribution

VLA models may excel at visually grounded tasks, but VLA-Trace reveals they still struggle with fine-grained semantic understanding and exhibit distinct modality processing strategies.

Abstract

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $\pi_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

Interpretability & Mechanistic Interp Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Related Papers