Search papers, labs, and topics across Lattice.
This paper investigates the "typography gap" in VLMs, where models excel at reading text in images but struggle with recognizing typographic attributes like font family, size, style, and color. Through a systematic evaluation of 15 VLMs across various fonts and difficulty levels, the authors identify a perception hierarchy, with color recognition being strong but font style detection being weak. They demonstrate that LoRA fine-tuning on synthetic data can significantly improve performance on font family and size, but font style remains challenging, suggesting limitations in current patch-based encoders.
VLMs can ace the spelling test but flunk the typography quiz, revealing a surprising blindness to font styles that persists even with larger models.
Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.