Search papers, labs, and topics across Lattice.
2
0
4
RLHF can be made more stable and effective by explicitly verifying and reinforcing policy improvements against a historical baseline, rather than relying solely on instantaneous reward signals.
Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.