Search papers, labs, and topics across Lattice.
The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning models, where traditional likelihood-based methods are ineffective. They observe that RLVR training leads to a structural convergence in reasoning, causing RL-seen prompts to generate more rigid and similar trajectories compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector based on nearest-neighbor edit distances between multiple completions of a prompt, which effectively identifies RLVR training data.
RLVR training leaves a tell-tale sign: prompts encountered during fine-tuning produce unusually similar reasoning trajectories, detectable without access to model internals.
Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.