Search papers, labs, and topics across Lattice.
This paper introduces MIDI, a novel multilingual idiom dataset that includes idiomatic expressions embedded in both sentence-level and conversational contexts across a range of languages. The study reveals that while conversational context enhances idiom comprehension, significant performance disparities persist, particularly in low-resource languages, where literal interpretations are notably more challenging than figurative ones. By employing controlled tests to analyze hidden representations, the authors highlight critical limitations in current NLP models regarding idiom understanding and reasoning capabilities.
Idiom comprehension in low-resource languages suffers significantly, with literal meanings proving far more challenging than figurative interpretations, even in context-rich conversations.
Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.