Search papers, labs, and topics across Lattice.
Reward-Conditioned Reinforcement Learning (RCRL) trains a single RL agent to optimize a family of reward specifications, even while collecting experience under only one nominal objective. This is achieved by conditioning the agent on reward parameterizations and learning multiple reward objectives from a shared replay buffer entirely off-policy. Experiments across diverse benchmarks show that RCRL improves performance under the nominal reward and enables efficient adaptation to new reward parameterizations, leading to robust and steerable policies.
Train one RL agent to handle a whole family of reward functions, unlocking robust and adaptable policies without the complexity of multi-task training.
RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.