Search papers, labs, and topics across Lattice.
This paper introduces two neuro-symbolic extensions to Proximal Policy Optimization (PPO), H-PPO-Product and H-PPO-SymLoss, that leverage logical policy specifications learned in simpler environments to guide learning in more complex, sparse-reward settings. H-PPO-Product biases the action distribution during sampling, while H-PPO-SymLoss adds a symbolic regularization term to the PPO loss. Experiments across OfficeWorld, WaterWorld, and DoorKey demonstrate that these methods achieve faster learning and higher returns compared to PPO and Reward Machine baselines, even with imperfect symbolic knowledge.
Neuro-symbolic guidance can dramatically accelerate reinforcement learning in sparse-reward environments, even when the symbolic knowledge is imperfect.
Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term. We evaluate our methods on three benchmarks (OfficeWorld, WaterWorld, and DoorKey), showing consistently faster learning and higher return at convergence than PPO and a Reward Machine baseline, also under imperfect symbolic knowledge.