2026

SPGDD-GPT: Image-Text-Driven Generic Defect Diagnosis Using a Self-Prompted Large Vision-Language Model

Abstract

Large Vision-Language Models (LVLMs) mainly rely on template-generated textual descriptions to understand defects. This reliance impairs the performance of these models for Industrial Defect Detection (IDD) because they typically lack specialized knowledge. On the other hand, the majority of existing IDD methods only utilize the contrastive loss function for image-to-text feature alignment, which limits their ability to focus on defective regions. In addition, these methods usually use cosine similarity for contextual learning, which also restricts their ability to understand and adapt to complex contexts. To address these issues, we first collect a large-scale defect data set with textual descriptions, namely, the Text-Augmented Defect Data Set (TADD), to fine-tune an LVLM for defect description. We also propose a Self-prompted Generic Defect Diagnosis (including Defect Detection and Defect Description) LVLM, i.e., the SPGDD-GPT. This method can effectively utilize contextual information through a Multi-scale Self-prompted Memory Module (MSSPMM) and a Text-Driven Defect Focuser (TDDF) that we deliberately design, to adapt to unseen defect categories and focus on abnormal regions. Experimental results show that our method normally achieves the better performance than its counterparts across the 21 subsets of TADD under the 1-shot, 2-shot and 4-shot defect detection settings, demonstrating strong detection and generalization capabilities. The source code, model, and data set are available at https://github.com/INDTLab/SPGDD-GPT. The proposed method can also generate a textural description of the defects contained in each test image. These promising results should be due to the proposed MSSPMM and TDDF and the large-scale TADD. Note to Practitioners—The proposed SPGDD-GPT is developed on top of an LVLM. It is specifically designed for the few-shot defect diagnosis task, including defect detection and defect description, which requires only a small number of training images. In real-world scenarios, the TADD effectively addresses the lack of detailed textual descriptions in training data, significantly alleviating the challenge of scarce textual data commonly encountered by practitioners in the field of defect diagnosis. By integrating a Text-Driven Defect Focuser (TDDF) and a Multi-scale Self-prompted Memory Module (MSSPMM), the SPGDD-GPT improves the alignment between visual and textual information, thereby improving the adaptability and robustness of the model in various scenarios. The TDDF explicitly adjusts the distance between normal and abnormal text embeddings through boundary hyperparameters, and achieves precise defect detection by reducing the Euclidean distance between abnormal image features and abnormal text representations, while the MSSPMM uses multi-scale normal samples as self-prompts which allow the model to rapidly adapt to novel object categories with limited samples and effectively attend to defective regions. Furthermore, the TADD consists of 35,741 images divided into 21 defect subsets with detailed textual descriptions that we annotate, providing rich contextual information. This data set facilitates the more comprehensive understanding of defect characteristics and enhances the generalizability of the model in real-world scenarios.

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueIEEE Transactions on Automation Science and Engineering

Related Papers

Finding related papers...

Search

SPGDD-GPT: Image-Text-Driven Generic Defect Diagnosis Using a Self-Prompted Large Vision-Language Model

Related Papers