Search papers, labs, and topics across Lattice.
This paper proposes a hybrid CNN-Transformer-ResNet architecture for image classification, aiming to leverage the strengths of each component: CNNs for local feature extraction, Transformers for global dependency modeling, and ResNet for mitigating vanishing gradients. The authors hypothesize that this combination will improve classification accuracy and computational efficiency, especially on large-scale image datasets. Experimental results on standard datasets demonstrate improved performance and generalization ability compared to individual models.
Fusing CNNs, Transformers, and ResNets boosts image classification accuracy and efficiency, offering a more robust solution for intelligent vision systems.
With the rapid development of deep learning, Convolutional Neural Network (CNN), Vision Transformer, and Residual Network (ResNet) have become commonly used and efficient technologies in image classification in the field of computer vision. However, how to achieve good classification performance in some specific situations is still a challenge. This study aims to combine the advantages of CNN, Transformer, and ResNet to improve the performance of image classification. Specifically, this study extracts the local features of the image through CNN and captures the details of the local information. Then, the global dependence in the image is effectively captured by the Vision Transformer model, which further enhances the information processing ability of the model. Finally, combined with the residual structure of ResNet, the problems of gradient disappearance that may occur in deep network training are solved, and the training efficiency and stability of the model are improved. The experimental results show that different models have different performance indexes on standard data sets, and they all show good generalization ability when dealing with large-scale image data. In addition, the use and fusion of different models not only improve the accuracy of image classification but also optimize the calculation efficiency, which provides a more reliable solution for the intelligent vision system in practical application.