Jun 16, 2026arXiv:2606.18430

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

AI Summary

This paper introduces signature filtering, a novel detection-time module that enhances the effectiveness of statistical watermark detection in large language models (LLMs) without altering watermark embedding or text generation. By identifying and removing a small set of "signature" tokens that compromise watermark reliability, the method significantly improves detection rates in challenging scenarios, achieving increases from 8-31% to 78-99% across various watermark families and benchmark corpora. The approach is shown to maintain low false positive rates and outperforms existing methods, providing a scalable solution for watermark-based provenance checks in LLM outputs.

Key Contribution

Signature filtering boosts watermark detection rates from single digits to nearly 100% by intelligently removing disruptive tokens, making it a game-changer for LLM attribution.

Abstract

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...