Search papers, labs, and topics across Lattice.
The paper introduces TraceGuard, a process-guided security framework designed to defend against reasoning backdoors in large language models (LLMs) by focusing on the Chain-of-Thought (CoT) reasoning trace. TraceGuard employs automated forensic synthesis to identify logical fractures, step-aware supervised fine-tuning to instill structural verification, and verifier-guided reinforcement learning to prevent lexical overfitting. Experiments show that TraceGuard enables smaller models (4B parameters) to achieve forensic precision comparable to much larger models, demonstrating robustness against unseen and adaptive attacks.
A 4B-parameter model, fortified with TraceGuard, can detect reasoning backdoors as effectively as models 100x larger, even against unseen and adaptive attacks.
The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.