Search papers, labs, and topics across Lattice.
COSPLAY is introduced, a co-evolution framework for LLMs in long-horizon tasks, where an LLM decision agent retrieves skills from a learned skill bank to guide action, while a skill pipeline agent discovers reusable skills from unlabeled rollouts to form the skill bank. This framework iteratively improves both the decision agent's skill retrieval and action generation, and the skill bank's ability to extract and refine skills. Experiments across six game environments demonstrate that COSPLAY, using an 8B model, achieves a 25.1% average reward improvement over LLM baselines in single-player games and remains competitive in multi-player social reasoning games.
LLMs can master long-horizon tasks by co-evolving a decision-making agent with a skill bank that learns and refines reusable skills from the agent's own experience.
Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.