Feb 23, 2026arXiv:2602.19562

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

AI Summary

This paper introduces a computational framework for aligning natural language descriptions with visual perceptual data, aiming to model human referential interpretation. The framework combines SIFT alignment with the Universal Quality Index (UQI) to approximate human perceptual categorization, and uses linguistic preprocessing to handle pragmatic variability. Evaluated on the Stanford Repeated Reference Game corpus, the model achieves robust referential grounding, requiring fewer utterances than humans to reach stable mappings and outperforming humans in identifying target objects from single referring expressions.

Key Contribution

A surprisingly simple computational model rivals human performance in referential grounding, suggesting that complex cross-modal alignment may not require equally complex mechanisms.

Abstract

Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65\% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66\% of the time (versus 20\% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at https://anonymous.4open.science/r/metasequoia-9D13/README.md .

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Related Papers