Search papers, labs, and topics across Lattice.
Tongji University, Ns,lH^i(l,s)=1.\hat{H}_{i}^{(l,s)}=\frac{H_{i}^{(l,s)}}{\sum_{j=1}^{N_{s,l}}H_{j}^{(l,s)}},\qquad\text{where}\quad\sum_{i=1}^{N_{s,l}}\hat{H}_{i}^{(l,s)}=1. (5) We then integrate this normalized entropy with the layer score ℛ(l,s)\mathcal{R}^{(l,s)} and a monotonic scale factor ϕ(s)=s/Smax\phi(s)=s/S_{\max} to define a unified pruning tendency: qi(s,l)=ϕ(s)⋅ℛ(l,s)⋅H^i(s,l)q_{i}^{(s,l)}=\phi(s)\cdot\mathcal{R}^{(l,s)}\cdot\hat{H}_{i}^{(s,l)}. This formulation smoothly incorporates all three dimensions, ensuring that tokens with higher entropy, in layers with broader semantic scope, and at deeper scales are more likely to be pruned. Finally, the pruning tendency is mapped to a retention probability: Pkeep(i∣s,l)={1,if s<D,1−clip(αmin+(αmax−αmin)qi(s,l), 0, 1),otherwise.P_{\text{keep}}(i\mid s,l)=\begin{cases}1,&\text{if }s<D,\\[6.0pt] 1-\operatorname{clip}\!\Big(\alpha_{\min}+(\alpha_{\max}-\alpha_{\min})\,q_{i}^{(s,l)},\;0,\;1\Big),&\text{otherwise.}\end{cases} (6) This three-factor integration provides a coherent pruning policy across tokens, layers, and scales, effectively discarding redundant details while preserving semantically critical structures. Computational Optimization for Attention Entropy. The attention entropy is formally defined in Equation 2. A straightforward implementation would require explicitly materializing the full N×NN\times N attention matrix in order to compute row-wise probability distributions and their entropy. However, such an approach is computationally prohibitive in practice, since modern attention implementations such as FlashAttention never instantiate the dense matrix explicitly due to memory and runtime constraints. To address this challenge, we extend the original FlashAttention algorithm with an online entropy computation mechanism, which we refer to as Flash Attention Entropy. The key idea is to preserve the memory efficiency of FlashAttention while simultaneously maintaining sufficient statistics for entropy computation. Inspired by the online softmax strategy in FlashAttention, we design an incremental update scheme that avoids forming the full attention matrix. More concretely, recall that the entropy involves terms of the form xlogxx\log x over normalized attention scores. By leveraging the algebraic identity kxlog(kx)=kxlogx+(logk)⋅xkkx\log(kx)=kx\log x+(\log k)\cdot xk We can decompose the entropy computation into two accumulative statistics: the standard normalization terms (rowmax mm and expsum ll) that are already tracked in FlashAttention, and an additional intermediate statistic that maintains xlogxx\log x. This decomposition ensures that the entropy can be computed in a streaming fashion with negligible overhead relative to the baseline FlashAttention kernel. The resulting algorithm, termed Flash Attention Entropy, thus inherits the linear-time and memory-efficient properties of FlashAttention while enabling exact entropy computation without approximation. 5 Experiments 5.1 Experimental Setup Table 1: Quantitative comparison on GenEval and DPG. Note, GenEval follows the official protocol without rewritten prompts. Latency is measured on a single GPU with batch size 1. Methods GenEval DPG Latency(s)↓\downarrow Speedup Two Obj. Position Color Attri. Overall↑\uparrow Entity Relation Attribute Overall↑\uparrow Infinity-, Corresponding Author
1
0
2
1
LLMs struggle to navigate the complexities of real-world finance, as evidenced by a new benchmark revealing their limitations in timeliness, regulatory compliance, and tool selection across 760 financial APIs.