UCSCMar 4, 2026arXiv:2603.04212

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

Jiaxun Guo, Ziyuan Yang, Mengyu Sun, H. Wang, Jingfeng Lu, Yi Zhang

AI Summary

The paper introduces the problem of model-level code attribution, aiming to identify the source LLM behind a generated code snippet. To address this, they propose the Disentangled Code Attribution Network (DCAN), which uses contrastive learning to separate source-agnostic semantic information from source-specific stylistic representations in code. Experiments on a new large-scale benchmark dataset of code generated by four LLMs (DeepSeek, Claude, Qwen, and ChatGPT) across four programming languages show that DCAN can reliably attribute code to its originating LLM.

Key Contribution

Think your LLM's code is anonymous? This paper shows you can fingerprint it with high accuracy, even across different programming languages.

Abstract

The rapid adoption of Large Language Models (LLMs) has transformed modern software development by enabling automated code generation at scale. While these systems improve productivity, they introduce new challenges for software governance, accountability, and compliance. Existing research primarily focuses on distinguishing machine-generated code from human-written code; however, many practical scenarios--such as vulnerability triage, incident investigation, and licensing audits--require identifying which LLM produced a given code snippet. In this paper, we study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code. Although attribution is challenging, differences in training data, architectures, alignment strategies, and decoding mechanisms introduce model-dependent stylistic and structural variations that serve as generative fingerprints. Leveraging this observation, we propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations. Through a contrastive learning objective, DCAN isolates discriminative model-dependent signals while preserving task semantics, enabling multi-class attribution across models and programming languages. To support systematic evaluation, we construct the first large-scale benchmark dataset comprising code generated by four widely used LLMs (DeepSeek, Claude, Qwen, and ChatGPT) across four programming languages (Python, Java, C, and Go). Experimental results demonstrate that DCAN achieves reliable attribution performance across diverse settings, highlighting the feasibility of model-level provenance analysis in software engineering contexts. The dataset and implementation are publicly available at https://github.com/mtt500/DCAN.

Code Generation & Program Synthesis Constitutional AI & AI Ethics Open-Source Models & Weights

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Code Fingerprints: Disentangled Attribution of LLM-Generated Code

Related Papers