Search papers, labs, and topics across Lattice.
This paper introduces a vision-language integration framework that combines pre-trained visual encoders (CLIP, ViT) with large language models (GPT-based) to enable zero-shot scene understanding. The framework aligns visual and textual modalities by embedding visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning. Experiments on Visual Genome, COCO, ADE20K, and custom datasets demonstrate significant improvements in object recognition, activity detection, and scene captioning, achieving up to 18% improvement in top-1 accuracy.
Zero-shot scene understanding gets a boost: aligning pre-trained vision and language models yields up to 18% accuracy gains in real-world object recognition and captioning tasks.
Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.