Search papers, labs, and topics across Lattice.
This paper investigates the use of Group Relative Policy Optimization (GRPO) to improve tool-use accuracy in Small Language Models (SLMs). A reward system was designed to reinforce structured JSON output, correct tool selection, and precise parameter usage during RL training. Results demonstrate that GRPO significantly enhances SLMs' tool-use capabilities, enabling more effective function calling and JSON output generation.
SLMs can punch above their weight in tool use with the right RL training, rivaling LLMs in specific function-calling tasks.
In an era where tool-augmented AI agents are becoming increasingly vital, our findings highlight the ability of Group Relative Policy Optimization (GRPO) to empower SLMs, which are traditionally constrained in tool use. The ability to use tools effectively has become a defining feature of Large Language Models (LLMs), allowing them to access external data and internal resources. As AI agents grow more sophisticated, tool-use capabilities have become indispensable. While LLMs have made significant progress in this area, Small Language Models (SLMs) still face challenges in accurately integrating tool use, especially in resource-constrained settings. This study investigates how Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), can enhance the tool-use accuracy of SLMs. By designing a well-defined reward system that reinforces structured JSON output, correct tool selection, and precise parameter usage, we demonstrate that GRPO enables SLMs to achieve significant improvements in tooluse capabilities (function calling/JSON output). Our approach provides a computationally efficient training method that enhances SLMs' practical deployment in real-world AI applications.