Search papers, labs, and topics across Lattice.
The paper presents a systematic audit of 17 shadow APIs claiming to provide access to frontier LLMs, revealing significant discrepancies in utility, safety, and model identity compared to official APIs. By auditing three representative shadow APIs, the authors found performance divergences up to 47.21%, unpredictable safety behaviors, and identity verification failures in 45.83% of fingerprint tests. This work highlights the risks associated with using shadow APIs and their potential to undermine the validity of research and harm users.
Shadow APIs promising access to top LLMs like GPT-5 and Gemini 2.5 often deliver significantly degraded performance (down to 47.21% accuracy) and fail identity verification, casting doubt on research relying on them.
Access to frontier large language models (LLMs), such as GPT-5 and Gemini-2.5, is often hindered by high pricing, payment barriers, and regional restrictions. These limitations drive the proliferation of $\textit{shadow APIs}$, third-party services that claim to provide access to official model services without regional limitations via indirect access. Despite their widespread use, it remains unclear whether shadow APIs deliver outputs consistent with those of the official APIs, raising concerns about the reliability of downstream applications and the validity of research findings that depend on them. In this paper, we present the first systematic audit between official LLM APIs and corresponding shadow APIs. We first identify 17 shadow APIs that have been utilized in 187 academic papers, with the most popular one reaching 5,966 citations and 58,639 GitHub stars by December 6, 2025. Through multidimensional auditing of three representative shadow APIs across utility, safety, and model verification, we uncover both indirect and direct evidence of deception practices in shadow APIs. Specifically, we reveal performance divergence reaching up to $47.21\%$, significant unpredictability in safety behaviors, and identity verification failures in $45.83\%$ of fingerprint tests. These deceptive practices critically undermine the reproducibility and validity of scientific research, harm the interests of shadow API users, and damage the reputation of official model providers.