Feb 19, 2026arXiv:2602.17387

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Changhun Kim, Martin Mayr, Thomas Gorges, Fei Wu, Mathias Seuret, Andreas Maier, Vincent Christlein

AI Summary

The paper introduces DRetHTR, a decoder-only Retentive Network (RetNet) for handwritten text recognition (HTR) designed to overcome the quadratic scaling issues of Transformers. DRetHTR achieves linear time and memory complexity during decoding by replacing softmax attention with softmax-free retention and incorporating multi-scale sequential priors. The method employs layer-wise gamma scaling to recover the inductive bias of attention, enabling state-of-the-art character error rates on multiple HTR datasets with significantly improved decoding speed and memory efficiency compared to decoder-only Transformers.

Key Contribution

Achieve Transformer-level handwritten text recognition accuracy with 1.6-1.9x faster inference and 38-42% less memory by ditching attention for a linear-time RetNet architecture.

Abstract

State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

Related Papers