Mar 30, 2026arXiv:2603.28681

Functional Natural Policy Gradients

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus

AI Summary

This paper introduces a cross-fitted debiasing technique for offline policy learning that achieves $\sqrt{N}$ regret even for complex policy classes exceeding the Donsker condition. The core result is a regret bound that explicitly separates policy-class complexity from environment dynamics complexity, revealing a trade-off between the two. This factorization allows for targeted improvements in either policy learning or environment modeling to optimize overall performance.

Key Contribution

Unlock $\sqrt{N}$ regret in offline policy learning, even with complex policy classes, by trading off policy and environment complexity.

Abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Functional Natural Policy Gradients

Related Papers