Search papers, labs, and topics across Lattice.
SarcasmMiner, a reinforcement learning-based post-training framework, is introduced to improve multimodal sarcasm detection by addressing hallucination in cross-modal reasoning. The approach reformulates sarcasm detection as structured reasoning and employs a dual-track distillation strategy, using teacher trajectories to initialize a student model and training a generative reward model (GenRM) to evaluate reasoning quality. The student model is then optimized using group relative policy optimization (GRPO) with decoupled rewards for accuracy and reasoning quality, achieving improved F1 scores on the MUStARD++ dataset.
SarcasmMiner uses reinforcement learning to teach models to detect sarcasm more robustly by rewarding accurate cross-modal reasoning, outperforming standard fine-tuning.
Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.