BAIRSydneyTogetherUT AustinJun 15, 2026arXiv:2606.16429

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Zhongzhu Zhou, Qingyang Wu, Junxiong Wang, Mayank Mishra, Shuaiwen Leon Song, S. Song, Ben Athiwaratkun, Chenfeng Xu, Chen Xu

AI Summary

This paper introduces Taylor-Calibrate, a novel initialization method for hybrid Gated DeltaNet (GDN) models that enhances the conversion of pretrained Transformer models into efficient long-context inference systems. By leveraging Taylor-guided teacher attention statistics, the method effectively sets key parameters such as value projections and gating dynamics, addressing the common issue of poor initial performance in converted models. The results demonstrate that Taylor-Calibrate significantly improves zero-shot performance and reduces the training token requirements by up to 9.2x compared to naive conversion methods across various teacher settings and layer retention policies.

Key Contribution

Achieving up to an 88x performance boost, Taylor-Calibrate transforms the way we convert pretrained Transformers into efficient hybrid attention models.

Abstract

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Related Papers