Information Systems Technology and DesignInstitute of High Performance ComputingSTAR) SingaporeSUTDTechnology and Research (AMay 25, 2026arXiv:2605.26092

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

Maoyang Xiang, Bo Wang, Tao Luo

AI Summary

The paper introduces Orthogonal Residual Projection (ORP), a novel quantization method for Transformers that addresses the "Low Angular Resolution Regime" in sub-4-bit Power-of-Two (PoT) quantization by using a dual-basis geometric projection to synthesize a higher-resolution residual lattice with shift-and-add operations. ORP's analytical solver significantly reduces calibration time, achieving a perplexity of 6.10 on LLaMA-2-7B under a 3-bit constraint, outperforming MAC-intensive baselines like AWQ without asymmetric scaling. Hardware synthesis results on a 28nm node confirm ORP's effectiveness in mitigating timing bottlenecks associated with dense multiplier trees.

Key Contribution

Forget slow matrix multiplies: this new quantization method lets you run LLaMA-2-7B in 3-bit with only bit shifts and additions, beating heavier methods like AWQ.

Abstract

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}. Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

Related Papers