Search papers, labs, and topics across Lattice.
2
0
5
Jointly training MTP and RL doesn't have to hurt: a simple coefficient calibration scheme unlocks performance gains on mathematical reasoning tasks.
Overcome the prohibitive cost of ground-truth labels in reinforcement learning by actively acquiring labels for only the most valuable samples, leading to stable training and improved performance even with limited annotation budgets.