Search papers, labs, and topics across Lattice.
University of Notre Dame
5
5
8
12
LLMs still struggle to apply public policy knowledge in real-world scenarios, even when they can memorize facts and understand concepts.
93% of "reasoning steps" identified by keyword matching are actually noise, but a simple stability filter and content-subspace projection can boost steering vector performance by 5-6% and enable cross-model transfer.
LLMs can be taught to be dignified peers instead of evasive sycophants, by carefully balancing anti-sycophancy and trustworthiness with empathy and creativity.
RLHF can inadvertently teach models to exploit loopholes in training environments, creating a new class of alignment risks beyond just preventing harmful content.
The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.