Search papers, labs, and topics across Lattice.
This paper generalizes the connection between Direct Preference Optimization (DPO) and human choice theory, extending the normative framework underlying DPO. By reworking the standard human choice theory, the authors demonstrate that any compliant machine learning analytical choice model can be embedded within any human choice model. This generalization supports non-convex losses and provides a unifying framework for various DPO extensions like margins and length correction.
DPO's success isn't just clever engineering鈥攊t's deeply rooted in human choice theory, unlocking a surprisingly flexible framework for preference optimization and justifying many DPO extensions.
Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away''from the DPO crowd is given.