Search papers, labs, and topics across Lattice.
This paper investigates the impact of visual text style (functional vs. decorative) on attribute-based descriptions generated by Large Visual Language Models (LVLMs). They find that even when LVLMs correctly identify the concept represented by the text, the visual style of the text significantly influences the attributes included in the model's description of that concept. This reveals a non-trivial style leakage from visual text style into semantic inference within LVLMs.
LVLMs leak visual text style into semantic inference, meaning the font of a word can change the attributes a model associates with the concept it represents.
When the visual style of text is considered, a wide variety can be observed in font, color, and size. However, when a word is read, its meaning is independent of the style in which it has been written or rendered. In this paper, we investigate whether, and how, the style in which a word is visualized in an image impacts the description that a Large Visual Language Model (LVLM) provides for the concept to which that word refers. Specifically, we investigate how functional text styles (readability-oriented, e.g., black sans-serif) versus decorative styles (display-oriented, e.g., colored cursive/script) affect LVLMs'descriptions of a concept in terms of the attributes of that concept. Our experiments study the situation in which the LVLM is able to correctly identify the concept referred to by a visual text, i.e., by a word or words rendered as an image, and in which the visual text style should not influence the attribute-based description that the LVLM produces. Our experimental results reveal that even when the concept is correctly identified, text style influences the model's attribute-based descriptions of the concept. Our findings demonstrate non-trivial style leakage from text style into semantic inference and motivate style-aware evaluation and mitigation for LVLM-based multimedia systems.