Search papers, labs, and topics across Lattice.
1
0
2
VESPO stabilizes off-policy RL training for LLMs by directly reshaping sequence-level importance weights, tolerating 64x policy staleness and asynchronous execution without collapse.