Search papers, labs, and topics across Lattice.
This paper explores lightweight feature-based methods for detecting LLM-generated code in the SemEval-2026 Task 13, focusing on binary classification (Subtask A). The approach uses ratio-based features, parsing engines, a programming-language classifier, and a code-vs-text line classifier to extract stylometric signals. A shallow decision tree combined with heuristic rules achieves competitive performance with near-instant inference, demonstrating a computationally efficient alternative to large pretrained models.
Forget the heavy transformers: surprisingly effective LLM-generated code detection can be achieved with lightweight stylometric features and decision trees, offering near-instant inference.
SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods. We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.