Search papers, labs, and topics across Lattice.
3
0
6
Mobile agents trained with RL struggle to generalize to new app interfaces, improving only 8.3% compared to supervised baselines, despite a 26.1% gain on unseen task instances.
Overcome policy lag in distributed RL with TV-ACPO, a method that aligns advantage functions and constrains policy updates, leading to more robust and scalable on-policy learning.
Forget hand-engineered reward functions: this method learns complex exploratory behaviors by simply predicting which states lead to unpredictable futures.