Search papers, labs, and topics across Lattice.
VLA-Trace is introduced as a diagnostic framework for Vision-Language-Action (VLA) models, using representation tracing via CKA, attention knockout interventions, and behavioral probes. The framework reveals distinct modality-specific adaptation dynamics, differing multimodal routing strategies, and limitations in fine-grained semantic following in $\pi_{0.5}$ and OpenVLA. This analysis provides insights into representation-preserving adaptation, causal VLA circuits, and compositional semantic control for VLA models.
VLA models may excel at visually grounded tasks, but VLA-Trace reveals they still struggle with fine-grained semantic understanding and exhibit distinct modality processing strategies.
Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $\pi_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.