WHUMay 1, 2026arXiv:2605.00814

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Siyuan Huang, Xiaoye Qu, Yafu Li, T. Zhu, Zefeng He, Muxin Fu, Daizong Liu, Weibo Zheng, Yu Cheng

AI Summary

The paper identifies a "Visual Signal Dilution" problem in LVLMs, where visual attention decays as textual history grows during autoregressive generation. To address this, they introduce Persistent Visual Memory (PVM), a lightweight learnable module that provides a distance-agnostic retrieval pathway for visual embeddings. Experiments on Qwen3-VL show PVM consistently improves accuracy, especially in complex reasoning, while resisting signal decay and accelerating convergence.

Key Contribution

LVLMs can maintain sharper visual focus during long-form generation by adding a lightweight, learnable memory module that bypasses attention dilution.

Abstract

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a"Visual Signal Dilution"phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References94

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Related Papers