Search papers, labs, and topics across Lattice.
University of Edinburgh
2
0
3
Forget hand-engineered reward shaping: PPO-LTL lets you specify complex safety requirements as LTL formulas and automatically penalizes violations during RL training.
RLHF's generalization gap can be decomposed into distinct error terms arising from reward shift and KL clipping, offering a more nuanced understanding of its limitations.