Search papers, labs, and topics across Lattice.
Institute for Natural Language Processing, University of Stuttgart
1
0
3
Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.