Search papers, labs, and topics across Lattice.
This paper addresses the vulnerability of large language models (LLMs) to model extraction attacks via hosted APIs by introducing a novel detection method based on benign-calibrated traffic-window distribution testing. The authors utilize maximum mean discrepancy (MMD) to embed incoming queries into a semantic space and assess deviations from historical benign traffic, achieving impressive detection metrics. The proposed detector demonstrates a 0.3% benign false positive rate and a 100% true positive rate for pure attacker queries, establishing a strong empirical baseline for model extraction detection in LLM API traffic.
An embarrassingly simple yet highly effective detector achieves 100% true positive rates for model extraction attacks while maintaining an exceptionally low false positive rate.
Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.