Search papers, labs, and topics across Lattice.
This paper introduces two training-only techniques, Regime Position Alignment (RPA) and a gain-aware controller called Guardian, to improve the efficiency of reasoning in small and medium Transformers without increasing test-time computational cost. RPA uses a length-aware attention prior to guide attention during training, while Guardian dynamically adjusts attention sharpness based on validation improvements. Experiments on WikiText 2 demonstrate that these techniques reduce validation cross-entropy while maintaining baseline latency and memory usage during inference by adding a precomputed bias.
Achieve more efficient reasoning in Transformers without increasing test-time cost by using training-only techniques that guide attention and dynamically adjust sharpness.
We study efficient reasoning under tight compute. We ask how to make structured, correct decisions without increasing test time cost. We add two training only components to small and medium Transformers that also transfer to broader differentiable optimizers. First, a length aware attention prior built via fuzzy regime position alignment, RPA, yields a normalized pre softmax bias that guides attention like a structured regularizer while adding no new inference parameters. Second, a minimal gain aware controller, Guardian, nudges attention sharpness only when validation improvements warrant it, following a two timescale policy gradient view of nonconvex optimization. It is disabled at inference. A KL perspective shows softmax of z plus log pi as MAP with KL regularization, grounding the prior in a principled objective. Under strict compute parity on WikiText 2, we reduce validation cross entropy while matching baseline latency and memory. At inference, we add a precomputed, cached prior B of T as a single additive bias per head. The controller does not run. In practice, this incurs negligible overhead, a cached bias add per head, with no measurable p50 latency shift. Our results suggest that length aware priors and late phase gain control preserve scarce improvements, especially in long span, noisy logit regimes, while keeping test time costs effectively unchanged.