Search papers, labs, and topics across Lattice.
This paper introduces EnvRL, a novel framework that enhances reinforcement learning for long-horizon tasks by integrating environment dynamics learning through auxiliary objectives of state prediction and inverse dynamics. By leveraging implicit supervision from interaction trajectories, EnvRL enables agents to construct a more accurate internal model of their environment, addressing the challenges posed by sparse outcome rewards. Experimental results show that EnvRL significantly boosts success rates in agentic benchmarks, outperforming traditional RL-only approaches, with notable improvements in performance metrics for models like Qwen-2.5-1.5B-Instruct.
By harnessing implicit supervision from environment dynamics, EnvRL boosts RL success rates by over 4% on long-horizon tasks, revealing a new frontier in agentic learning.
Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.