Search papers, labs, and topics across Lattice.
The paper introduces PVT-GDLA, a decoder-centric Transformer architecture for medical image segmentation that achieves linear time complexity. The core of PVT-GDLA is Gated Differential Linear Attention (GDLA), which computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale, along with a lightweight, head-specific gate to inject nonlinearity and input-adaptive sparsity. Experiments across CT, MRI, ultrasound, and dermoscopy benchmarks demonstrate that PVT-GDLA achieves state-of-the-art accuracy with comparable parameters but lower FLOPs than existing CNN-, Transformer-, hybrid-, and linear-attention-based methods.
Achieve state-of-the-art medical image segmentation accuracy with a linear-time transformer decoder that overcomes the limitations of standard linear attention by subtracting complementary attention paths to amplify relevant context.
Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.