Search papers, labs, and topics across Lattice.
This paper evaluates various AI text detection methods, including TF-IDF logistic regression, BiLSTM, and DistilBERT, on the HC3 and DAIGT v2 datasets using a topic-based data split to prevent information leakage. The study finds that while TF-IDF logistic regression provides a baseline accuracy of 82.87%, deep learning models, particularly DistilBERT, achieve superior performance with an accuracy of 88.11% and an ROC-AUC score of 0.96. The results emphasize the importance of contextual semantic modeling over lexical features and the need for robust evaluation protocols to avoid topic memorization.
DistilBERT can detect AI-generated text with 88% accuracy and 0.96 ROC-AUC, outperforming traditional methods and highlighting the importance of contextual understanding.
The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.