BUPTJun 10, 2026arXiv:2606.11837

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

AI Summary

This paper introduces LASA, a weak supervision method for open-vocabulary scene sketch semantic segmentation that aggregates attention maps from different layers of a Vision Transformer to enhance semantic understanding of sparse line drawings. By leveraging the complementary spatial cues captured by shallow and deep layers, LASA significantly improves the robustness of structural priors compared to using individual layers. Experimental results demonstrate that LASA achieves substantial improvements in mean Intersection over Union (mIoU) across multiple datasets, indicating enhanced segmentation accuracy and spatial coherence.

Key Contribution

Cross-layer attention aggregation in LASA boosts semantic segmentation accuracy by over 15% in challenging sketch datasets, revealing the untapped potential of multi-layer features.

Abstract

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

Related Papers