Search papers, labs, and topics across Lattice.
This paper replaces the dense output projection in multi-head attention with a fixed Walsh Hadamard Transform followed by a lightweight affine rescaling. This substitution reduces attention parameters by approximately 25% per block while preserving global cross-head interaction. Experiments on standard benchmarks show comparable or slightly superior downstream task performance, achieving up to 7% parameter reduction, 8.9% peak memory savings, and 6.6% throughput improvement, with gains increasing with model size.
Ditch 25% of your Transformer's attention parameters without sacrificing performance by swapping the dense output projection for a structured Hadamard transform, and watch your throughput climb.
The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.