Search papers, labs, and topics across Lattice.
Zhejiang University, The Chinese University of Hong Kong ♦, Eastern Institute of Technology
2
0
4
ISPO reduces critical reasoning failures in RLVR by transforming reward structures, leading to superior performance on complex reasoning tasks.
GUI agents can learn world knowledge more efficiently by internalizing causal relationships during mid-training, rather than relying on implicit learning through action annotations or reward signals in post-training.