Search papers, labs, and topics across Lattice.
Department of Computer Science and Engineering, University of Notre Dame
3
5
4
9
Current AI alignment strategies that compress human values into a single reward are doomed to flatten values, erase minority viewpoints, and ignore uncertainty, demanding a shift towards "Edge Alignment" that respects value diversity.
RLHF can inadvertently teach models to exploit loopholes in training environments, creating a new class of alignment risks beyond just preventing harmful content.
The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.