Feb 23, 2026arXiv:2602.19982

A Computationally Efficient Multidimensional Vision Transformer

AI Summary

This paper introduces a novel Vision Transformer architecture (TCP-ViT) based on the Tensor Cosine Product (Cproduct) to reduce computational and memory costs. The method leverages multilinear structures in image data and the orthogonality of cosine transforms for efficient attention and structured feature representations. Experiments on classification and segmentation benchmarks show a 1/C parameter reduction (where C is the number of channels) with competitive accuracy.

Key Contribution

Squeeze your Vision Transformers: a new tensor-based approach slashes parameters by a factor of C without sacrificing accuracy.

Abstract

Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Computationally Efficient Multidimensional Vision Transformer

Related Papers