Jan 9, 2026

Vision-Language Feature-Based Multimodal Semantic Communication System

AI Summary

The paper introduces Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a system that encodes images into a compact vision-language semantic feature (VLF) using a pre-trained vision-language model (VLM) for transmission. This unified VLF representation is then used to condition both a language model for text generation and a diffusion model for image generation at the receiver. Results demonstrate that VLF-MSC achieves higher semantic accuracy for both modalities under low SNR with reduced bandwidth compared to unimodal baselines, highlighting its robustness and spectral efficiency.

Key Contribution

Transmitting a single, compact vision-language representation beats sending separate image and text streams for multimodal semantic communication, especially under noisy conditions.

Abstract

We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned+ on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueConsumer Communications and Networking Conference

Related Papers

Finding related papers...

Search

Vision-Language Feature-Based Multimodal Semantic Communication System

Related Papers