CambridgeChongqingHKUSTJun 16, 2026arXiv:2606.17682

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

AI Summary

This paper introduces the LLM-as-Environment-Engineer framework, which automates the redesign of training environments in reinforcement learning by allowing the current policy model to analyze failure trajectories and propose modifications. Utilizing the MAPF-FrozenLake testbed, the framework conditions the environment engineer on structured summaries of policy behavior, leading to superior performance compared to larger proprietary LLMs and traditional fixed-environment baselines. A key finding reveals that the current RL checkpoint is more effective at diagnosing weaknesses and suggesting improvements than the original base model, highlighting the benefits of iterative learning in environment design.

Key Contribution

The current RL checkpoint outperforms larger LLMs in redesigning training environments, revealing that iterative learning enhances diagnostic capabilities.

Abstract

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Related Papers