B-Thinking-2506 (Team et al.Hacettepe UniversityKoç UniversityApr 16, 2026arXiv:2604.15210

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Hatice Vural, Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Arikan, Demir Ekin Arikan, Bob Mankoff, Erkut Erdem, Aykut Erdem

AI Summary

This paper introduces Incongruity-Resolution Supervision (IRS), a framework that decomposes humor understanding into incongruity modeling, resolution modeling, and preference alignment, mirroring the cognitive processes of human captionists. IRS provides structured supervision for intermediate reasoning steps, guiding models to explicitly learn the path from visual perception to humorous interpretation. Experiments on the New Yorker Cartoon Caption Contest (NYCC) demonstrate that IRS significantly improves performance across various model sizes, achieving near expert-level ranking and showing zero-shot transferability, suggesting structured reasoning supervision is more effective than scale alone.

Key Contribution

Forget scaling laws: teaching models *how* to think like a cartoonist unlocks expert-level humor understanding, even surpassing larger black-box models.

Abstract

Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Related Papers