PrincetonSJTUJun 9, 2026arXiv:2606.11182

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

AI Summary

This paper introduces EEVEE, a novel multi-dataset test-time prompt learning framework designed for large language model (LLM) agents to effectively manage heterogeneous input streams in real-world applications. By implementing a router that clusters tasks and assigns appropriate prompt configurations, EEVEE addresses the limitations of existing single-dataset methods and enhances the model's robustness across diverse data streams. Experimental results reveal that EEVEE significantly outperforms state-of-the-art methods, achieving improvements of up to 48.2% in multi-benchmark scores, thereby demonstrating its efficacy in real-world scenarios.

Key Contribution

EEVEE's innovative router-prompt co-evolution strategy enables LLMs to thrive in diverse, real-world task environments, outperforming existing methods by up to 48.2%.

Abstract

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Related Papers