BAIRJun 3, 2026arXiv:2606.04433

Stateful Visual Encoders for Vision-Language Models

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

AI Summary

This paper introduces a Stateful Visual Encoder that enhances vision-language models (VLMs) by conditioning visual representations on prior visual features, addressing the limitations of stateless encoders that treat each image independently. The method significantly improves performance in tasks requiring cross-image spatial aggregation and visual differencing, demonstrating its effectiveness across various input resolutions and model architectures. Real-world applications in fields such as radiology and remote sensing show that stateful encoders not only outperform generalist VLMs but can also compete with specialized models in specific domains.

Key Contribution

Stateful Visual Encoders enable VLMs to leverage prior visual context, leading to substantial performance gains in multi-image tasks.

Abstract

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stateful Visual Encoders for Vision-Language Models

Related Papers