BITFuzhou UniversityInformation and Communication BranchMay 4, 2026

An On-Site Equipment Defect Recognition Method for UAV Inspection Scenarios via Spatial Reasoning Learning With Vision–Language Multimodal Large Models

AI Summary

This paper introduces an innovative method for on-site defect recognition in UAV inspections of power transmission lines and substations, leveraging spatial reasoning learning with multimodal large models. By employing a combination of Variational Autoencoders for defect image generation and a Swin Transformer-based visual encoder, the approach effectively addresses challenges posed by small-scale defects and complex backgrounds. The method achieves high precision and efficiency in defect detection, making it suitable for real-time edge inference in resource-constrained environments.

Key Contribution

High-precision defect detection in UAV inspections is now achievable in real-time, even in complex environments, thanks to a novel spatial reasoning approach.

Abstract

Power drone inspections have become a crucial method for monitoring the status of transmission lines and substation equipment. However, in real-world scenarios, defect targets typically exhibit features such as small scale, weak texture, and complex backgrounds with strong interference. The traditional model struggles to meet the demand for rapid on-site closed-loop resolution. To address these challenges, this paper proposes an on-site defect identification method for drone inspection scenarios based on spatial reasoning learning from multimodal image-text large models. First, we develop a defect image generation method using Variational Autoencoders (VAE) and conditional score matching diffusion. By embedding semantic information into the feature layer of the U-Net decoder via the multi-layer Spatial Adaptive Normalization and Decomposition (SPADE) operator, we precisely control the generation location and morphology of defect features. Second, at the model level, a visual encoder based on Swin Transformer is constructed. A hierarchical window attention mechanism extracts multi-scale defect and scene topological features. A text encoder is built using the Next Token Prediction pre-training method and a large language model, combined with domain-specific fine-tuning using power system terminology databases and equipment ledger knowledge. At the cross-modal fusion layer, a contrastive learning mechanism aligns defect images and textual descriptions within a unified vector space. For spatial reasoning learning, we construct visually-linguistically intertwined regional inference samples. Region-level instruction fine-tuning drives the model to perform dynamic region cropping and multi-step reasoning analysis. Regarding on-site deployment, we combine CPU-NPU heterogeneous co-acceleration with memory optimization strategies. Through dynamic task allocation and data prefetching techniques, we achieve low-power, low-latency real-time edge inference. Experimental results demonstrate that our method delivers high-precision and high-efficiency defect detection in power inspection scenarios, effectively addressing complex backgrounds and resource-constrained field environments.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueIEEE Access

Related Papers

Finding related papers...

Search

An On-Site Equipment Defect Recognition Method for UAV Inspection Scenarios via Spatial Reasoning Learning With Vision–Language Multimodal Large Models

Related Papers