RAIUT AustinJun 15, 2026arXiv:2606.16178

Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Rutav Shah, Rajat Kumar Jenamani, Xiaohan Zhang, Lingfeng Sun, Roberto Martín-Martín, Yuke Zhu, Deva Ramanan, Karl Schmeckpeper

AI Summary

This paper introduces PRISM, a transformer-based architecture designed to enhance short-term memory in visuomotor policies for long-horizon tasks, addressing the limitations of traditional imitation learning approaches that rely solely on immediate sensory input. By employing gated attention to filter out irrelevant details and a hierarchical architecture to compress and integrate local information, PRISM effectively captures temporally extended dependencies, allowing for memory scaling up to two minutes. The proposed method outperforms existing models, achieving significant improvements in performance across various household manipulation tasks, as evaluated by the newly introduced ReMemBench benchmark.

Key Contribution

PRISM achieves up to 15% performance gains in visuomotor tasks by leveraging short-term memory, challenging the notion that immediate sensory input suffices for complex decision-making.

Abstract

Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Related Papers