Search papers, labs, and topics across Lattice.
1
0
0
0
Al-phaPO is introduced, a new DAA method that leverages an α -parameter to help change the shape of the reward function beyond the standard log reward, and helps maintain fine-grained control over likelihood displacement and over-optimization.