SJTUWeChat AIApr 13, 2026arXiv:2604.11627

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Haicheng Wang, Yu-An Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Weidi Xie, Yanfeng Wang

AI Summary

POINTS-Long, a dual-mode MLLM, dynamically scales visual tokens to balance efficiency and accuracy, drawing inspiration from human visual attention. It employs a focus mode for detailed tasks and a standby mode that uses significantly fewer tokens (1/40-1/10) for general visual understanding while retaining high accuracy (97.7-99.7%). The model also supports streaming visual understanding via a detachable KV-cache, enabling efficient processing of ultra-long visual sequences.

Key Contribution

MLLMs can achieve near-identical performance on long-form visual tasks with just 2.5% of the original visual tokens by mimicking human visual attention.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Related Papers