CASof Artificial Intelligence (TeleAI)ZJUJun 1, 2026arXiv:2606.02132

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang

AI Summary

This paper introduces EAPO, an Efficient Agentic Policy Optimization framework designed to mitigate tool abuse in agentic reinforcement learning by promoting selective tool use. By incorporating tool-free trajectories and applying difficulty-aware reward shaping, EAPO effectively balances the trade-off between tool usage and internal reasoning capabilities. The framework demonstrates significant improvements across multiple benchmarks, achieving up to 10.45% better performance while reducing tool calls by over 18%, indicating that agents can optimize their reasoning without excessive reliance on external tools.

Key Contribution

EAPO enables agents to learn when to forgo tool use, achieving a remarkable 10.45% performance boost while slashing tool calls by over 18%.

Abstract

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

RLHF & Preference Learning Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Related Papers