NUSSCUApr 13, 2026arXiv:2604.11502

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei, See-Kiong Ng

AI Summary

The paper introduces METER, a benchmark designed to evaluate LLMs' contextual causal reasoning across the levels of association, intervention, and counterfactual reasoning within a unified context. Experiments using METER reveal a significant performance drop as tasks increase in causal complexity, indicating limitations in LLMs' ability to maintain context faithfulness and resist distraction from irrelevant information. Mechanistic analysis identifies distraction by factually correct but causally irrelevant information and degraded context faithfulness as key failure modes.

Key Contribution

LLMs struggle to maintain context and avoid distraction when reasoning about causality, leading to a significant performance drop as tasks increase in complexity.

Abstract

Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Related Papers