NUSApr 13, 2026arXiv:2604.11399

Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu

AI Summary

The paper introduces MERIT, a training-free framework to restore temporal reasoning (TR) in Video-Language Models (VLMs) by selectively merging layers from a text-only LLM backbone. MERIT optimizes for improved TR while maintaining temporal perception (TP) by searching for optimal layer-wise self-attention merging recipes. Experiments across multiple VLMs and benchmarks demonstrate that MERIT improves TR, preserves TP, and outperforms full-model merging and random layer selection, highlighting the importance of targeted layer selection.

Key Contribution

VLMs can regain lost temporal reasoning abilities without retraining, simply by strategically merging the right layers from their text-only LLM backbone.

Abstract

Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

Related Papers