Search papers, labs, and topics across Lattice.
This paper introduces ParetoPO, a two-stage multi-objective optimization framework that enhances the alignment of tool-using large language models (LLMs) by balancing task accuracy with tool-use efficiency. By employing hypervolume-guided dynamic scalarization and Pareto-ranking-based advantage computation, the method enables fine-grained optimization across conflicting objectives, leading to improved performance in complex reasoning tasks. Experimental results demonstrate that ParetoPO outperforms static and heuristic baselines, achieving superior accuracy-efficiency trade-offs in mathematical reasoning and multi-hop question answering tasks.
Achieving superior accuracy-efficiency trade-offs, ParetoPO redefines how tool-integrated agents can be optimized for real-world applications.
Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.