Search papers, labs, and topics across Lattice.
2
11
2
1
Al-phaPO is introduced, a new DAA method that leverages an α -parameter to help change the shape of the reward function beyond the standard log reward, and helps maintain fine-grained control over likelihood displacement and over-optimization.
AlphaPO unlocks 7-50% better LLM alignment by showing that reward function *shape* is a surprisingly powerful lever in Direct Alignment Algorithms.