Search papers, labs, and topics across Lattice.
This paper introduces a framework for safe and governed evolution of embodied agent capabilities, addressing the challenge of deploying new capability versions without violating policy constraints or execution assumptions. The framework employs a lifecycle-aware upgrade process with staged runtime pipelines, including candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. Empirical evaluation demonstrates that the governed upgrade framework maintains task success rates while eliminating unsafe activations, outperforming naive upgrade strategies.
Naive upgrades of embodied agent capabilities lead to unsafe activations in 60% of cases, but a governed upgrade framework can maintain task success while ensuring zero unsafe activations.
Embodied agents are increasingly expected to improve over time by updating their executable capabilities rather than rewriting the agent itself. Prior work has separately studied modular capability packaging, capability evolution, and runtime governance. However, a key systems problem remains underexplored: once an embodied capability module evolves into a new version, how can the hosting system deploy it safely without breaking policy constraints, execution assumptions, or recovery guarantees? We formulate governed capability evolution as a first-class systems problem for embodied agents. We propose a lifecycle-aware upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks -- interface, policy, behavioral, and recovery -- and organizes them into a staged runtime pipeline comprising candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. We evaluate over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.