Search papers, labs, and topics across Lattice.
University of Notre Dame
4
5
6
9
LLMs still struggle to apply public policy knowledge in real-world scenarios, even when they can memorize facts and understand concepts.
Current AI alignment strategies that compress human values into a single reward are doomed to flatten values, erase minority viewpoints, and ignore uncertainty, demanding a shift towards "Edge Alignment" that respects value diversity.
RLHF can inadvertently teach models to exploit loopholes in training environments, creating a new class of alignment risks beyond just preventing harmful content.
The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.