Search papers, labs, and topics across Lattice.
Flux Attention is introduced as a context-aware hybrid attention mechanism that dynamically routes layers to either Full Attention (FA) or Sparse Attention (SA) based on the input context, addressing the quadratic complexity bottleneck of standard attention in long-context LLMs. A lightweight Layer Router is integrated into frozen pretrained LLMs, trained for only 12 hours on 8xA800 GPUs, to enable this adaptive routing. Experiments on long-context and mathematical reasoning benchmarks demonstrate a superior performance/inference speed trade-off, achieving up to 2.8x and 2.0x speedups in prefill and decode stages, respectively, compared to baseline models.
Forget static attention allocation – Flux Attention dynamically routes layers between full and sparse attention based on context, delivering significant speedups without sacrificing performance in long-context LLMs.
The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages.