WashUFeb 16, 2026arXiv:2602.15143

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

AI Summary

The paper explores methods to protect LLMs from unauthorized knowledge distillation by rewriting teacher-generated reasoning traces to both degrade their training usefulness (anti-distillation) and embed verifiable signatures (API watermarking). They introduce LLM-based and gradient-based techniques for dynamically rewriting reasoning outputs while preserving answer correctness and semantic coherence. Experiments demonstrate that instruction-based rewriting effectively achieves anti-distillation and enables reliable watermark detection, even improving teacher performance in some cases.

Key Contribution

Deter unauthorized LLM distillation by rewriting reasoning traces to degrade student model training and embed verifiable watermarks, without sacrificing teacher performance.

Abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

Inference & Quantization Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Related Papers