Search papers, labs, and topics across Lattice.
The paper introduces Dual-Consensus Weak-to-Strong (DC-W2S), a framework for training reliable Process Reward Models (PRMs) using noisy weak supervision in scientific reasoning tasks. DC-W2S stratifies supervision signals based on Self-Consensus (SC) and Neighborhood-Consensus (NC) metrics to identify high-quality training data. By using instance-level balanced sampling and label-level reliability-aware masking, DC-W2S trains robust PRMs without requiring extensive expert annotations.
Strategic data curation using a dual-consensus approach beats brute-force training on large noisy datasets for process reward modeling in biological reasoning.
In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.