Search papers, labs, and topics across Lattice.
3
2
5
4
Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.
Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.
Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.