Search papers, labs, and topics across Lattice.
2
0
4
Free exploration in multi-armed bandits can lead to sharp phase transitions in accumulated regret, offering significant savings compared to standard regret minimization.
RLVR's reasoning gains hinge on high-entropy tokens, revealing a critical inefficiency in uniform reward broadcast that EAPO effectively addresses.