MBZUAIFeb 2, 2026arXiv:2602.02079

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel, Dilshod Azizov, Iryna Gurevych, Preslav Nakov

AI Summary

The paper introduces AICD Bench, a large-scale benchmark for AI-generated code detection, addressing limitations of existing datasets by incorporating distribution shifts, model family attribution, and fine-grained human-machine classification. The benchmark comprises 2 million examples, 77 models across 11 families, and 9 programming languages. Experiments using neural and classical detectors reveal significant performance gaps, especially under distribution shift and for hybrid/adversarial code, highlighting the need for more robust detection methods.

Key Contribution

Detecting AI-generated code is harder than you think: even state-of-the-art detectors fail to reliably identify machine-written code, especially when faced with distribution shifts or adversarial attacks.

Abstract

Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench}.

Citation Metrics

Citations0

Influential citations0

References63

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Related Papers