ColumbiaGeorge Washington UniversityNortheasternSan Francisco State UniversityMar 21, 2025

MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Xuan Li, Fu Lei, Jinghan Cao, Qiyuan Tian, Kowei Shih

AI Summary

The paper introduces MLIF-Net, a multimodal architecture that fuses Vision Transformers (ViT) and Large Language Models (LLMs) to improve the detection of AI-generated images. It employs a Cross-Attention Mechanism for visual-semantic feature fusion and a Multiscale Contextual Reasoning Layer to capture both global and local image features. Experiments demonstrate that MLIF-Net achieves superior accuracy, recall, and Average Precision (AP) compared to existing methods for AI-generated content detection.

Key Contribution

Achieve state-of-the-art AI-generated image detection by fusing visual and semantic features using a novel cross-attention mechanism between Vision Transformers and Large Language Models.

Abstract

This paper presents the Multimodal Language-Image Fusion Network (MLIF-Net), a new architecture to distinguish AI-generated images from real ones. MLIF-Net combines Vision Transformer (ViT) and Large Language Models (LLMs) to build a multimodal feature fusion network that improves AI-generated content detection accuracy. The model uses a Cross-Attention Mechanism to combine visual and semantic features and a Multiscale Contextual Reasoning Layer to capture both global and local image features. An Adaptive Loss Function improves the consistency and robustness of feature extraction. Experimental results show that MLIF-Net outperforms existing models in accuracy, recall, and Average Precision (AP). This approach can lead to more accurate detection of AI-generated content and may have applications in other generative content tasks.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References8

Year2025

Venue2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE)

Related Papers

Finding related papers...

Search

MLIF-Net: Multimodal Fusion of Vision Transformers and Large Language Models for AI Image Detection

Related Papers