Search papers, labs, and topics across Lattice.
The paper addresses the computational bottleneck of the prefill stage in long-context LLM inference by introducing Cross-Layer Attention Aggregation (CLAA) to improve token-ranking heuristics. They diagnose instability in existing token importance estimation across layers using a novel Answer-Informed Oracle that measures attention from generated answers back to the prompt to define ground-truth token importance. CLAA aggregates token importance scores across layers, mitigating the variance and achieving up to 39% reduction in Time-to-First-Token (TTFT) compared to the full KV cache baseline.
Token ranking heuristics for LLM prefill are surprisingly unstable across layers, but simply aggregating attention scores across layers can dramatically improve performance.
The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.