Search papers, labs, and topics across Lattice.
The paper introduces MLIF-Net, a multimodal architecture that fuses Vision Transformers (ViT) and Large Language Models (LLMs) to improve the detection of AI-generated images. It employs a Cross-Attention Mechanism for visual-semantic feature fusion and a Multiscale Contextual Reasoning Layer to capture both global and local image features. Experiments demonstrate that MLIF-Net achieves superior accuracy, recall, and Average Precision (AP) compared to existing methods for AI-generated content detection.
Achieve state-of-the-art AI-generated image detection by fusing visual and semantic features using a novel cross-attention mechanism between Vision Transformers and Large Language Models.
This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net), a new architecture to distinguish AI-generated images from real ones. MLIF-Net combines Vision Transformer (ViT) and Large Language Models (LLMs) to build a multimodal feature fusion network that improves AI-generated content detection accuracy. The model uses a Cross-Attention Mechanism to combine visual and semantic features and a Multiscale Contextual Reasoning Layer to capture both global and local image features. An Adaptive Loss Function improves the consistency and robustness of feature extraction. Experimental results show that MLIF-Net outperforms existing models in accuracy, recall, and Average Precision (AP). This approach can lead to more accurate detection of AI-generated content and may have applications in other generative content tasks.