HKUHuaweiZJUJun 15, 2026arXiv:2606.16111

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

Junyi Li, Xiaowei Qian, Yingyi Zhang, Wenlin Zhang, Guojing Li, Sheng Zhang, Xiao Han, Yichao Wang, Xiangyu Zhao

AI Summary

This paper introduces ParetoPO, a two-stage multi-objective optimization framework that enhances the alignment of tool-using large language models (LLMs) by balancing task accuracy with tool-use efficiency. By employing hypervolume-guided dynamic scalarization and Pareto-ranking-based advantage computation, the method enables fine-grained optimization across conflicting objectives, leading to improved performance in complex reasoning tasks. Experimental results demonstrate that ParetoPO outperforms static and heuristic baselines, achieving superior accuracy-efficiency trade-offs in mathematical reasoning and multi-hop question answering tasks.

Key Contribution

Achieving superior accuracy-efficiency trade-offs, ParetoPO redefines how tool-integrated agents can be optimized for real-world applications.

Abstract

Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

RLHF & Preference Learning Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

Related Papers