Search papers, labs, and topics across Lattice.
2
3
3
1
RLHF can be significantly improved for complex tasks by explicitly modeling preference relationships both within and between training examples, unlocking better instruction following without relying on expensive human annotation or biased LLM-generated data.
RLHF reward models can be made significantly less susceptible to length bias by explicitly modeling and disentangling semantic preferences from length requirements.