Search papers, labs, and topics across Lattice.
School of Computer Science and Technology, Harbin Institute of Technology, China
1
0
2
Token-level policy gradients fall short in complex reasoning tasks, but treating sequences of tokens as unified actions can significantly boost performance in mathematical and coding benchmarks.