Search papers, labs, and topics across Lattice.
This paper introduces a diagnostic dataset and NLI task to evaluate how language models handle the proviso problem, a challenge in pragmatics concerning presupposition projection in conditional sentences. Experiments with RoBERTa, DeBERTa, LLaMA, and Gemma reveal that while models often align with human judgments, they tend to rely on surface-level patterns instead of deeper semantic or pragmatic reasoning. The study underscores the necessity of diagnostic datasets and multi-faceted evaluation methods for assessing pragmatic competence in language models.
LLaMA and Gemma may seem to understand complex conditional statements, but they're really just pattern-matching, not grasping the underlying pragmatic nuances of presuppositions.
We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.