Search papers, labs, and topics across Lattice.
The paper introduces WARP, a constraint-based framework for provably repairing Transformer models against adversarial attacks by optimizing weights beyond the final layer. WARP formulates repair as a convex quadratic program based on a first-order linearization of the logit gap, enabling optimization over a large parameter space while providing per-sample guarantees on classification, preservation, and robustness. Empirical results on encoder-only Transformers demonstrate the practical effectiveness and validity of these guarantees in improving robustness.
Provable adversarial repair of Transformers is now possible beyond the last layer, thanks to a new framework that formulates repair as a tractable convex optimization problem.
Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.