Institut für Theoretische PhysikNational High Magnetic Field LaboratoryVTTMar 30, 2026arXiv:2603.28534

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Younes Javanmard, Tanmoy Pandit, Masoud Mardani

AI Summary

This paper explores Matrix Product Operator (MPO) decomposition as a compression technique for Transformer language models, specifically applied to PicoGPT. They replace linear layers with MPOLinear modules, factorizing weight matrices into low-rank cores and training them using standard PyTorch autograd. Results show that MPO compression achieves up to 13x compression with minimal accuracy loss, demonstrating its potential as a practical and theoretically sound alternative to other compression methods.

Key Contribution

Forget pruning or quantization: MPO decomposition lets you compress a transformer by 13x while retaining 97% accuracy.

Abstract

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

Related Papers