Tsinghua AIMay 28, 2026arXiv:2605.02772

Linearizing Vision Transformer with Test-Time Training

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang

AI Summary

This paper tackles the challenge of transferring weights from pretrained Softmax attention models to linear-complexity attention mechanisms by employing Test-Time Training (TTT) to align both architecture and representation. By introducing key instance normalization and a locality enhancement module, the authors successfully linearize Stable Diffusion 3.5, resulting in a model (SD3.5-T^5) that maintains high text-to-image quality while significantly improving inference speed. The findings demonstrate that with just one hour of fine-tuning, SD3.5-T^5 achieves performance comparable to its Softmax counterpart, highlighting the efficiency of TTT in bridging the representational gap.

Key Contribution

Achieving comparable text-to-image quality with a linearized model that accelerates inference by up to 1.47 times, all while leveraging pretrained weights.

Abstract

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Linearizing Vision Transformer with Test-Time Training

Related Papers