CASMar 11, 2026arXiv:2603.11220

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

AI Summary

The paper introduces Frequency-Modulated Visual Restoration (FMVR), a plug-and-play strategy to improve LMM reasoning under visual token reduction by disentangling visual representations into low- and high-frequency components. FMVR uses AvgPool and MaxPool to derive and modulate these frequencies with lightweight learnable parameters, enhancing both salient and weak visual semantics. Integrating FMVR with Matryoshka Representation Learning allows for elastic adjustment of visual tokens during inference while preserving performance, achieving significant FLOPs reduction with minimal accuracy loss.

Key Contribution

LMMs can slash FLOPs by 89% without sacrificing accuracy, thanks to a frequency-modulated visual restoration technique that preserves crucial visual semantics even with fewer tokens.

Abstract

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References80

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

Related Papers