Search papers, labs, and topics across Lattice.
Beihang University
3
0
6
Overcome the prohibitive cost of ground-truth labels in reinforcement learning by actively acquiring labels for only the most valuable samples, leading to stable training and improved performance even with limited annotation budgets.
RLHF can be made more stable and effective by explicitly verifying and reinforcing policy improvements against a historical baseline, rather than relying solely on instantaneous reward signals.
Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.