Yongxin Yang

Queen Mary University of London, UK {waylon.li,tiejun.ma}@ed.ac.uk scohen@inf.ed.ac.uk Abstract Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt’s semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention. 1 Introduction The ability to precisely guide the behaviour of large language models (LLMs) is paramount as they are increasingly deployed in high-stakes domains. This broad field of model steering encompasses various techniques, from activation steering, which aims to control high-level semantic attributes like style or factual recall by intervening in MLP layers (Subramani et al., 2022; Turner et al., 2023; Qiu et al., 2024; Turner et al., 2024; Wang et al., 2025; Stolfo et al., 2025), to attention steering, which operates at a more granular level to direct the model’s focus to specific tokens within a prompt. This paper focuses on the latter, where prompt highlighting is one of the key applications. Current state-of-the-art methods, such as PASTA (Zhang et al., 2024), operate by editing the attention score matrix after it has been computed. This post-hoc manipulation creates a critical bottleneck: it requires computing the full attention matrix, making these methods incompatible with modern, IO-aware implementations like FlashAttention (Dao et al., 2022; Dao, 2024) that are essential for efficient processing. This architectural limitation, coupled with the need for costly, task-specific searches to identify which attention heads to steer, makes them less practical. In this paper, we propose to intervene in the input of the attention mechanism rather than edit its output. We introduce Spectral Editing Key Amplification (SEKA), a novel, training-free framework that steers attention by directly modifying key vectors before the attention scores are calculated. Our core insight is that we can learn a universal “relevance subspace” for a given task by applying spectral decomposition to key embeddings derived from contrastive prompts. These learned directions are then used to construct a projection matrix that amplifies the relevant features of highlighted keys via a simple, geometrically interpretable transformation: 𝒌′=𝒌+g𝑷𝒌{\bm{k}}^{\prime}={\bm{k}}+g{\bm{P}}{\bm{k}}. Additionally, we propose Adaptive SEKA (AdaSEKA), an advanced variant that learns a bank of task-specific “expert” projections (e.g., for factual recall versus instruction following). At inference time, AdaSEKA uses a computationally cheap, training-free routing mechanism to create a dynamic, query-aware steering operator by blending these experts based on the prompt’s semantic intent. Our method is fully compatible with FlashAttention as it operates directly on the key embeddings with negligible computational overhead. Our experiments confirm the effectiveness of this approach. Both SEKA and AdaSEKA achieve superior results on standard benchmarks for knowledge conflicts, occupation extraction, and instruction following. Furthermore, AdaSEKA’s query-adaptive routing mechanism demonstrates superior performance by dynamically tailoring the steering to the prompt’s semantic intent. Crucially, we show that these performance gains are achieved with negligible overhead. SEKA adds only ≈\approx0.03s of latency per sample, in stark contrast to comparable methods like PASTA which incur a +1.03s inference time and nearly double the memory usage. 2 Problem Definition and Motivations In this section, we formalise the problem of prompt highlighting as an instance of attention bias and present the motivation for our spectral attention steering approach, which aims to address the limitations of existing methods. Problem Definition. Given a prompt 𝒙=(x1,…,xT){\bm{x}}=(x_{1},\ldots,x_{T}) consisting of TT tokens, with a subset of token indices ℋ⊂{1,…,T}{\mathcal{H}}\subset\{1,\ldots,T\} identifying the highlighted tokens (in practice, surrounded by markers such as **), our goal is to steer the attention of the model so that these tokens receive increased focus from queries. In standard multi-head attention, the unnormalised attention score between query ii and key jj is Attn(i,j)=𝒒i⊤𝒌jdk\textrm{Attn}(i,j)=\frac{{\bm{q}}_{i}^{\top}{\bm{k}}_{j}}{\sqrt{d_{k}}}, where 𝒒i,𝒌j∈ℝdk{\bm{q}}_{i},{\bm{k}}_{j}\in\mathbb{R}^{d_{k}} are the query and key vectors, and dkd_{k} is the head dimension. Objective. We aim to amplify the attention assigned to highlighted tokens by introducing an additive, controllable term to the attention score for each (i,j)(i,j) where j∈ℋj\in{\mathcal{H}}: Aij′=Aij+ΔijA_{ij}^{\prime}=A_{ij}+\Delta_{ij}, where Δij\Delta_{ij} is designed to selectively boost the attention towards user-specified highlighted tokens. Motivation. Existing approaches typically modify attention after it has been computed. For example, PASTA (Zhang et al., 2024) rescales rows of the attention matrix as shown in equation 1, where CiC_{i} is a row normalisation factor and α>1\alpha>1 scales attention to highlighted tokens. [T(𝑨)]ij={αAijCi,if j∈ℋ,AijCi,otherwise.[T({\bm{A}})]_{ij}=\begin{cases}\alpha\displaystyle\frac{A_{ij}}{C_{i}},&\text{if }j\in{\mathcal{H}},\\ \displaystyle\frac{A_{ij}}{C_{i}},&\text{otherwise.}\end{cases} (1) Similarly, positional calibration methods such as Found-in-the-Middle (Hsieh et al., 2024) subtract a baseline from the positional attention bias. Let xkx_{k} denote the position of the kk-th token, and Attnori(xk)\textrm{Attn}_{\text{ori}}(x_{k}) the original positional bias. The calibrated bias is Attncalibrated(xk)=Attnori(xk)−Attnbaseline(xk)\textrm{Attn}_{\text{calibrated}}(x_{k})=\textrm{Attn}_{\text{ori}}(x_{k})-\textrm{Attn}_{\text{baseline}}(x_{k}), where Attnbaseline(xk)\textrm{Attn}_{\text{baseline}}(x_{k}) is estimated independently of content relevance. Both strategies require explicit storage of the full attention matrix, which is incompatible with memory-efficient implementations such as FlashAttention (Dao et al., 2022; Dao, 2024). Moreover, methods like PASTA often rely on costly head search to decide which attention heads to steer. These limitations motivate the consideration for an alternative steering mechanism that operates before attention scores are computed, avoiding any need to materialise or modify the attention matrix. Since attention depends on query–key inner products, equivalent control can be achieved by editing either representation (shown in Section 3.2). Given our objective of amplifying attention to a specific subset of tokens ℋ{\mathcal{H}}, key-side intervention is the natural choice: the key vector 𝒌j{\bm{k}}_{j} is indexed by token position jj and therefore governs how strongly each individual token is attended to. To provide empirical evidence on whether such a pre-attention intervention is feasible, we analyse how key representations change under shifts in contextual relevance. We first construct synthetic contrastive prompt triplets under three conditions: (1) neutral (context only), (2) positive (context aligned with a relevant query), and (3) negative (context paired with an irrelevant query). The construction of such synthetic triplets is described in Appendix A. Using the Qwen3-1.

Papers on Lattice

Total citations

Topics