Mar 3, 2026arXiv:2603.02675

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma

AI Summary

The paper addresses the vulnerability of Large Language Models (LLMs) to adversarial prefix attacks, which they attribute to "Shallow Safety Alignment" caused by semantic representation decay of malicious intent. They introduce Two-Stage Causal-GRPO (TSC-GRPO), a framework that first trains a causal intent probe to disentangle invariant intent, and then internalizes this causal awareness into the policy using Group Relative Policy Optimization. By applying a cumulative causal penalty, TSC-GRPO forces the model to learn that harmful tokens monotonically decrease reward, resulting in improved robustness against jailbreaks and preservation of general utility.

Key Contribution

LLMs' vulnerability to adversarial prefixes isn't just about lacking safety training data, but a deeper problem of "semantic representation decay" that a causal approach can fix.

Abstract

Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Related Papers