UCSDApr 1, 2026arXiv:2604.00938

WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Hsin-Ling Hsu, Minyu Chen, Nai-Chia Chen, Yanru Chen, Yi-Ling Chang, Fang Yu

AI Summary

The paper introduces WARP, a constraint-based framework for provably repairing Transformer models against adversarial attacks by optimizing weights beyond the final layer. WARP formulates repair as a convex quadratic program based on a first-order linearization of the logit gap, enabling optimization over a large parameter space while providing per-sample guarantees on classification, preservation, and robustness. Empirical results on encoder-only Transformers demonstrate the practical effectiveness and validity of these guarantees in improving robustness.

Key Contribution

Provable adversarial repair of Transformers is now possible beyond the last layer, thanks to a new framework that formulates repair as a tractable convex optimization problem.

Abstract

Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WARP: Guaranteed Inner-Layer Repair of NLP Transformers

Related Papers