Search papers, labs, and topics across Lattice.
This paper systematically analyzes the impact of efficient attention modules in hybrid architectures of language models, focusing on scaling behavior, mechanism analysis, and architecture design. The authors reveal that while efficient attention primarily influences the speed of long-context capability emergence, all hybrid models ultimately achieve similar long-context performance with adequate training. Notably, they introduce the concept of "Large-Window Laziness," demonstrating that larger sliding-window attention (SWA) can hinder the development of retrieval heads in full-attention layers, and they propose a targeted optimization method that enhances long-context performance without sacrificing short-context efficiency.
Larger sliding-window attention can paradoxically slow down the formation of critical retrieval mechanisms in language models, challenging conventional design assumptions.
Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.