Search papers, labs, and topics across Lattice.
Beijing Institute of Technology
3
0
6
Forget monolithic policies – splitting your LLM's RL policy into accuracy-focused and exploration-driven modes unlocks better performance and diversity.
Open-source 7B LLMs can now rival GPT-4o performance on validation tasks, thanks to a novel reinforcement learning approach that leverages calibrated self-evaluation as a dense reward signal.
LLMs ace the setup but fumble the execution in mathematical modeling, revealing a critical gap that scaling alone won't fix.