Mar 9, 2026arXiv:2603.08343

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Shubham Aggarwal, L. Kumar, Lokendra Kumar

AI Summary

This paper replaces the dense output projection in multi-head attention with a fixed Walsh Hadamard Transform followed by a lightweight affine rescaling. This substitution reduces attention parameters by approximately 25% per block while preserving global cross-head interaction. Experiments on standard benchmarks show comparable or slightly superior downstream task performance, achieving up to 7% parameter reduction, 8.9% peak memory savings, and 6.6% throughput improvement, with gains increasing with model size.

Key Contribution

Ditch 25% of your Transformer's attention parameters without sacrificing performance by swapping the dense output projection for a structured Hadamard transform, and watch your throughput climb.

Abstract

The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References14

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Related Papers