USCApr 20, 2026arXiv:2604.17814

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Meifang Chen, Mei-Yen Chen, Nianchen Huang, Huang Nianchen, Yizhan Huang, Zihan Li, Michael R. Lyu

AI Summary

This paper investigates secret leakage in Code LLMs, revealing a "gibberish bias" stemming from Byte-Pair Encoding (BPE) tokenization. They find that secrets with high character-level entropy but low token-level entropy are disproportionately memorized due to token distribution shifts between training data and secret data. The study highlights the vulnerability of current tokenization methods and suggests potential mitigation strategies.

Key Contribution

Code LLMs are surprisingly good at memorizing gibberish secrets, thanks to quirks in how they tokenize code.

Abstract

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary''trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Related Papers