Search papers, labs, and topics across Lattice.
This paper introduces LASA, a weak supervision method for open-vocabulary scene sketch semantic segmentation that aggregates attention maps from different layers of a Vision Transformer to enhance semantic understanding of sparse line drawings. By leveraging the complementary spatial cues captured by shallow and deep layers, LASA significantly improves the robustness of structural priors compared to using individual layers. Experimental results demonstrate that LASA achieves substantial improvements in mean Intersection over Union (mIoU) across multiple datasets, indicating enhanced segmentation accuracy and spatial coherence.
Cross-layer attention aggregation in LASA boosts semantic segmentation accuracy by over 15% in challenging sketch datasets, revealing the untapped potential of multi-layer features.
Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.