Corresponding authorApr 21, 2026arXiv:2604.19021

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Pingwei Sun, Yuxuan Hu, Jianchao Tan, Xue Wang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

AI Summary

This paper introduces FG$^2$-GDN, an improved Gated Delta Network (GDN) that enhances long-context associative recall by replacing the scalar learning rate in the delta update rule with a channel-wise vector, enabling dimension-specific adaptation. FG$^2$-GDN+ further decouples the scaling for keys and values, allowing independent control of erasure and write strength. Experiments on synthetic and real-world tasks demonstrate that FG$^2$-GDN and FG$^2$-GDN+ outperform GDN and KDA in associative recall and long-context understanding while maintaining comparable computational efficiency.

Key Contribution

Channel-wise adaptive learning rates in Gated Delta Networks unlock superior long-context recall, rivaling softmax attention without the quadratic cost.

Abstract

Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $\beta_t$ in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $\beta_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Related Papers