Search papers, labs, and topics across Lattice.
This paper introduces PRISM, a novel framework for multimodal sentiment analysis that addresses limitations of early aggregation by organizing multimodal evidence in a shared prototype space to enable structured cross-modal comparison. PRISM employs dynamic modality reweighting during reasoning, allowing continuous refinement of modality contributions as semantic interactions evolve. Experiments on three benchmark datasets demonstrate that PRISM achieves state-of-the-art performance compared to existing methods.
Forget monolithic sentiment vectors: PRISM adaptively fuses multimodal cues by comparing them in a shared prototype space, leading to state-of-the-art sentiment analysis.
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.