Mar 31, 2026arXiv:2603.29450

Few-shot Writer Adaptation via Multimodal In-Context Learning

T. Simon, Stéphane Nicolas, Pierrick Tranouez, C. Chatelain, Thierry Paquet

AI Summary

This paper introduces a multimodal in-context learning framework for few-shot writer adaptation in Handwritten Text Recognition (HTR), enabling personalization to unseen handwriting styles without parameter updates. They leverage a compact 8M-parameter CNN-Transformer and explore the impact of context length to facilitate effective adaptation using only a few examples from the target writer. Experiments on IAM and RIMES datasets demonstrate that this context-driven approach achieves state-of-the-art results, surpassing writer-independent HTR models with Character Error Rates of 3.92% and 2.34%, respectively.

Key Contribution

Forget fine-tuning: this HTR model adapts to new handwriting styles in just a few shots, *without* any parameter updates.

Abstract

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Few-shot Writer Adaptation via Multimodal In-Context Learning

Related Papers