Search papers, labs, and topics across Lattice.
This paper introduces a method for safe optimal control in high-dimensional environments by decomposing the Bellman Value function for complex temporal logic tasks into a graph of simpler Bellman Values (Reach-Avoid, Avoid, and Reach-Avoid-Loop). This decomposition allows for a more structured representation of the task, avoiding the need for cumbersome automata and laborious reward tuning. The authors propose VDPPO, a two-layer neural network architecture that embeds this decomposed Value graph and leverages bootstrapping to solve for the optimal policy, demonstrating improved performance in simulated and hardware experiments.
Decomposing Bellman values into a graph of simpler objectives lets agents master complex, high-dimensional tasks with less tuning and better safety.
Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.