Guojie Luo

Ns,lH^i(l,s)=1.\hat{H}_{i}^{(l,s)}=\frac{H_{i}^{(l,s)}}{\sum_{j=1}^{N_{s,l}}H_{j}^{(l,s)}},\qquad\text{where}\quad\sum_{i=1}^{N_{s,l}}\hat{H}_{i}^{(l,s)}=1. (5) We then integrate this normalized entropy with the layer score ℛ(l,s)\mathcal{R}^{(l,s)} and a monotonic scale factor ϕ(s)=s/Smax\phi(s)=s/S_{\max} to define a unified pruning tendency: qi(s,l)=ϕ(s)⋅ℛ(l,s)⋅H^i(s,l)q_{i}^{(s,l)}=\phi(s)\cdot\mathcal{R}^{(l,s)}\cdot\hat{H}_{i}^{(s,l)}. This formulation smoothly incorporates all three dimensions, ensuring that tokens with higher entropy, in layers with broader semantic scope, and at deeper scales are more likely to be pruned. Finally, the pruning tendency is mapped to a retention probability: Pkeep(i∣s,l)={1,if s<D,1−clip⁡(αmin+(αmax−αmin)qi(s,l), 0, 1),otherwise.P_{\text{keep}}(i\mid s,l)=\begin{cases}1,&\text{if }s<D,\\[6.0pt] 1-\operatorname{clip}\!\Big(\alpha_{\min}+(\alpha_{\max}-\alpha_{\min})\,q_{i}^{(s,l)},\;0,\;1\Big),&\text{otherwise.}\end{cases} (6) This three-factor integration provides a coherent pruning policy across tokens, layers, and scales, effectively discarding redundant details while preserving semantically critical structures. Computational Optimization for Attention Entropy. The attention entropy is formally defined in Equation 2. A straightforward implementation would require explicitly materializing the full N×NN\times N attention matrix in order to compute row-wise probability distributions and their entropy. However, such an approach is computationally prohibitive in practice, since modern attention implementations such as FlashAttention never instantiate the dense matrix explicitly due to memory and runtime constraints. To address this challenge, we extend the original FlashAttention algorithm with an online entropy computation mechanism, which we refer to as Flash Attention Entropy. The key idea is to preserve the memory efficiency of FlashAttention while simultaneously maintaining sufficient statistics for entropy computation. Inspired by the online softmax strategy in FlashAttention, we design an incremental update scheme that avoids forming the full attention matrix. More concretely, recall that the entropy involves terms of the form xlog⁡xx\log x over normalized attention scores. By leveraging the algebraic identity kxlog⁡(kx)=kxlog⁡x+(log⁡k)⋅xkkx\log(kx)=kx\log x+(\log k)\cdot xk We can decompose the entropy computation into two accumulative statistics: the standard normalization terms (rowmax mm and expsum ll) that are already tracked in FlashAttention, and an additional intermediate statistic that maintains xlog⁡xx\log x. This decomposition ensures that the entropy can be computed in a streaming fashion with negligible overhead relative to the baseline FlashAttention kernel. The resulting algorithm, termed Flash Attention Entropy, thus inherits the linear-time and memory-efficient properties of FlashAttention while enabling exact entropy computation without approximation. 5 Experiments 5.1 Experimental Setup Table 1: Quantitative comparison on GenEval and DPG. Note, GenEval follows the official protocol without rewritten prompts. Latency is measured on a single GPU with batch size 1. Methods GenEval DPG Latency(s)↓\downarrow Speedup Two Obj. Position Color Attri. Overall↑\uparrow Entity Relation Attribute Overall↑\uparrow Infinity-

Papers on Lattice

Total citations

Topics

h-index

Publication activitypapers/week, last 8 weeks

Research focus

Inference & Quantization (1)Multimodal Models (1)Robotics & Embodied AI (1)Architecture Design (Transformers, SSMs, MoE) (1)

Frequent co-authors

Zihao Zheng (2)Jiayu Chen (2)Maoliang Li (2)Sicheng Tian (1)

Papers (2)

Mar 9, 2026

Mar 9, 2026·also Corresponding Author

RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models

VLA models get a 1.73x speedup with only 5-7% overhead thanks to RAPID, a new edge-cloud collaborative inference framework that smartly handles visual noise and motion continuity.

Zihao Zheng, Sicheng Tian, Hangyu Cao +9

Inference & Quantization Multimodal Models Robotics & Embodied AI

Feb 26, 2026

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Attention entropy reveals exploitable sparsity in VAR models, enabling 3.4x faster image generation without sacrificing quality.

Jiayu Chen, Ruoyu Lin, Zihao Zheng +4

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Search

Guojie Luo

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (2)