Search papers, labs, and topics across Lattice.
This paper investigates the safety of Retrieval-Augmented Language Models (RAGs) in multi-turn interactions, revealing a "monitoring-control gap" where models acknowledge contradictory evidence but fail to incorporate it into their final recommendations. The authors evaluated four model families (1.5B-32B parameters) across 50,000 turn-level evaluations using a multi-turn document accumulation protocol. The key finding is that single-turn robustness metrics overestimate RAG safety, as contradiction acknowledgement doesn't guarantee safe resolution, highlighting a critical vulnerability in RAG systems for high-stakes applications.
RAG systems can *know* the evidence contradicts their actions, yet still fail to act safely, revealing a dangerous monitoring-control gap that current evaluations miss.
Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.