Search papers, labs, and topics across Lattice.
The paper introduces Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a system that encodes images into a compact vision-language semantic feature (VLF) using a pre-trained vision-language model (VLM) for transmission. This unified VLF representation is then used to condition both a language model for text generation and a diffusion model for image generation at the receiver. Results demonstrate that VLF-MSC achieves higher semantic accuracy for both modalities under low SNR with reduced bandwidth compared to unimodal baselines, highlighting its robustness and spectral efficiency.
Transmitting a single, compact vision-language representation beats sending separate image and text streams for multimodal semantic communication, especially under noisy conditions.
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned+ on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.