RochesterSichuan Agricultural UniversityXiamen UniversityMar 17, 2026arXiv:2603.16859

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng

AI Summary

The paper introduces SocialOmni, a new benchmark to evaluate the social interactivity of Omni-modal large language models (OLMs) by assessing their ability to handle speaker separation/identification, interruption timing, and natural interruption generation. The benchmark includes 2,000 perception samples and 209 interaction-generation instances with temporal and contextual constraints, as well as controlled audio-visual inconsistencies. Benchmarking 12 leading OLMs reveals a decoupling between perceptual accuracy and interruption generation, highlighting the limitations of understanding-centric metrics for evaluating conversational social competence.

Key Contribution

Current Omni-modal LLMs can ace perception tasks but still fail at basic social interactions like knowing when and how to jump into a conversation.

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

Eval Frameworks & Benchmarks Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Related Papers