Search papers, labs, and topics across Lattice.
1
0
2
0
By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.