Search papers, labs, and topics across Lattice.
Nanyang Technological University
2
0
5
Mismatched SFT data hurting your LLM's reasoning? DART uses RL to transform it into perfectly aligned training examples, boosting generalization and efficiency.
Overcome the prohibitive cost of ground-truth labels in reinforcement learning by actively acquiring labels for only the most valuable samples, leading to stable training and improved performance even with limited annotation budgets.