Search papers, labs, and topics across Lattice.
This paper addresses Silent Scope Omission (SSO) in rule-following agents by introducing NormBench, a benchmark designed to evaluate and improve defeasible scope parsing across multiple legal contexts. The authors utilize Span-Grounded Deontic Trees (SG-DT) to create a structured representation that anchors logical branches to source spans, enhancing the models' ability to identify which clauses override others. Evaluations reveal critical pathologies in current LLMs, including Recursion Decay and the Auditability Trap, while demonstrating that SG-DT significantly improves performance in SSO-prone scenarios, particularly when exceptions are active.
LLMs struggle with complex legal rules, often missing critical exceptions, but a new benchmark and structured representation can dramatically enhance their compliance accuracy.
Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.