Tsinghua AIAI LaboratorySEUJun 16, 2026arXiv:2606.17680

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

AI Summary

This paper introduces EnvRL, a novel framework that enhances reinforcement learning for long-horizon tasks by integrating environment dynamics learning through auxiliary objectives of state prediction and inverse dynamics. By leveraging implicit supervision from interaction trajectories, EnvRL enables agents to construct a more accurate internal model of their environment, addressing the challenges posed by sparse outcome rewards. Experimental results show that EnvRL significantly boosts success rates in agentic benchmarks, outperforming traditional RL-only approaches, with notable improvements in performance metrics for models like Qwen-2.5-1.5B-Instruct.

Key Contribution

By harnessing implicit supervision from environment dynamics, EnvRL boosts RL success rates by over 4% on long-horizon tasks, revealing a new frontier in agentic learning.

Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

Related Papers