Mar 8, 2026arXiv:2603.07647

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

Jun Sun, Boyu Yang, Jiahao Zhang, Ning Ma, Chencheng Wu, Siqing Zhang, Yiou Huang, Qiufeng Wang, Shan Liang, Yaran Chen

AI Summary

TempoFit introduces a training-free method to equip frozen Vision-Language-Action (VLA) policies with temporal memory for improved long-horizon manipulation. It leverages existing prefix attention K/V in the VLA model as a content-addressable runtime state, reusing them across timesteps without introducing new tokens or trainable modules. By incorporating Frame-Gap Temporal Bias (FGTB) and pre-attention residual loading, TempoFit achieves significant performance gains on long-horizon tasks while maintaining near-real-time latency and transferring effectively to real-robot settings.

Key Contribution

Give your memory-less VLA policy a brain: TempoFit retrofits temporal context by cleverly reusing existing attention keys and values, boosting long-horizon task success without retraining or adding latency.

Abstract

Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

Related Papers