Mar 15, 2026arXiv:2603.14602

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya

AI Summary

The paper introduces a multi-stage alignment method, $PA^3$, to improve LLMs' adherence to complex business rules in conversational tool-use scenarios. $PA^3$ trains models to recall and apply relevant policies during chain-of-thought reasoning, avoiding the need for full policy inclusion in the context window. By using a novel PolicyRecall reward and Hallucination Penalty during GRPO training, the method achieves superior performance compared to in-context baselines, with a 16-point improvement over the baseline and a 3-point improvement over comparable in-context baselines, while also reducing context length by 40%.

Key Contribution

LLMs can learn to internalize and apply complex business rules during chain-of-thought reasoning, outperforming in-context learning while using significantly shorter prompts.

Abstract

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Related Papers