Search papers, labs, and topics across Lattice.
2
0
3
Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.
Differential TD learning, a practical algorithm for average reward RL, now has stronger theoretical footing, converging under more realistic conditions.