Search papers, labs, and topics across Lattice.
This paper surveys existing AI-generated essay detectors and their responsible use, then empirically evaluates the cross-LLM generalizability of such detectors. The study trains detectors on essays generated by one LLM and tests their performance on essays from other LLMs using GRE writing prompts. The results offer practical guidance on developing and retraining detectors to maintain effectiveness across diverse LLMs.
AI-generated essay detectors can fail spectacularly when faced with text from a different LLM than they were trained on, demanding caution in their deployment.
Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.