Search papers, labs, and topics across Lattice.
This paper introduces a Stateful Visual Encoder that enhances vision-language models (VLMs) by conditioning visual representations on prior visual features, addressing the limitations of stateless encoders that treat each image independently. The method significantly improves performance in tasks requiring cross-image spatial aggregation and visual differencing, demonstrating its effectiveness across various input resolutions and model architectures. Real-world applications in fields such as radiology and remote sensing show that stateful encoders not only outperform generalist VLMs but can also compete with specialized models in specific domains.
Stateful Visual Encoders enable VLMs to leverage prior visual context, leading to substantial performance gains in multi-image tasks.
Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/