Search papers, labs, and topics across Lattice.
Beijing University Of Posts and Telecommunications
1
0
3
Ditch the black-box reward function: this new rubric-based RL framework uses LLMs to judge responses against interpretable criteria, offering a more robust and transparent approach to alignment.