Search papers, labs, and topics across Lattice.
This paper addresses the problem of segmenting human-LLM co-authored text by adapting change point detection techniques from time-series analysis. They introduce a weighted algorithm and a generalized algorithm to handle varying detection score reliability. The approach achieves strong empirical performance in localizing human- and LLM-authored segments, outperforming existing binary classification methods.
Pinpointing exactly where humans end and LLMs begin in co-authored text is now possible, thanks to a clever adaptation of time-series change point detection.
The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insufficient for human--LLM co-authored text, where the objective is to localize specific segments authored by humans or LLMs. To bridge this gap, we propose algorithms to segment text into human- and LLM-authored pieces. Our key observation is that such a segmentation task is conceptually similar to classical change point detection in time-series analysis. Leveraging this analogy, we adapt change point detection to LLM-generated text detection, develop a weighted algorithm and a generalized algorithm to accommodate heterogeneous detection score variability, and establish the minimax optimality of our procedure. Empirically, we demonstrate the strong performance of our approach against a wide range of existing baselines.