Search papers, labs, and topics across Lattice.
Tencent PCG, Corresponding Author
4
0
10
0
Jointly training MTP and RL doesn't have to hurt: a simple coefficient calibration scheme unlocks performance gains on mathematical reasoning tasks.
Even GPT-4 struggles to maintain its reasoning prowess when faced with the rigor and efficiency demands of a realistic high school exam, suggesting current LMMs are far from being ready for prime time as intelligent tutors.
LLM unlearning via counterfactual tuning can backfire, increasing hallucination rates in unexpected areas due to inconsistencies in the "fake" knowledge it's trained on.
Overcome the prohibitive cost of ground-truth labels in reinforcement learning by actively acquiring labels for only the most valuable samples, leading to stable training and improved performance even with limited annotation budgets.