PKUTencent AIMay 26, 2026arXiv:2605.26971

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang

AI Summary

The paper introduces ATLAS, a framework for tracing the lineage of Reinforcement Learning from Verifiable Rewards (RLVR) datasets to their atomic sources, attributing over 99.7% of 1.45M instances to 20 sources. Analysis using ATLAS reveals that many RLVR datasets are derived from a small set of shared upstream sources, highlighting data contamination risks and a lack of genuinely new data. To address this, the authors curate a new decontaminated dataset, DAPO++, using Source-level Counterfactual Attribution (SCA) and a composite quality score Q, demonstrating improved performance on Qwen3 models.

Key Contribution

Most RLVR datasets are just remixes of a few originals, and this paper shows how to trace them back to their source, revealing widespread data contamination.

Abstract

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

Data Curation & Synthetic Data Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...