Feb 25, 2026arXiv:2602.21496

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu

AI Summary

The paper introduces SemSIEdit, an inference-time framework employing an agentic "Editor" to iteratively critique and rewrite sensitive spans in LLM outputs, aiming to mitigate Semantic Sensitive Information (SemSI) leakage. They demonstrate a Privacy-Utility Pareto Frontier, achieving a 34.6% reduction in SemSI leakage with a 9.8% utility loss using this rewriting approach. The study also reveals a Scale-Dependent Safety Divergence where larger models enhance safety through constructive expansion, while smaller models rely on destructive truncation, and a Reasoning Paradox where reasoning increases both risk and defense efficacy.

Key Contribution

LLMs face a Scale-Dependent Safety Divergence: larger reasoning models achieve safety by adding nuance, whereas capacity-constrained models revert to deleting text.

Abstract

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputation-harmful content, or hallucinate potentially wrong information. The capacity of LLMs to self-regulate these complex, context-dependent sensitive information leaks without destroying utility remains an open scientific question. To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer. Our analysis reveals a Privacy-Utility Pareto Frontier, where this agentic rewriting reduces leakage by 34.6% across all three SemSI categories while incurring a marginal utility loss of 9.8%. We also uncover a Scale-Dependent Safety Divergence: large reasoning models (e.g., GPT-5) achieve safety through constructive expansion (adding nuance), whereas capacity-constrained models revert to destructive truncation (deleting text). Finally, we identify a Reasoning Paradox: while inference-time reasoning increases baseline risk by enabling the model to make deeper sensitive inferences, it simultaneously empowers the defense to execute safe rewrites.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

Related Papers