Feb 24, 2026arXiv:2602.21424

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

AI Summary

This paper formalizes "behavioural dependency" in RL agents under partial observability, defining it as the variation in action selection with respect to internal information under fixed observations. It then establishes three structural results: the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation, behavioural distance contracts under convex combination, and a sufficient local condition for gradient ascent to decrease behavioural distance. The paper provides empirical evidence in bandit and gridworld environments supporting these theoretical findings, showing that behavioural distance decreases under convex aggregation and skewed latent priors.

Key Contribution

Mixing RL policies can unexpectedly erase learned information-seeking behaviors, even when individual policies exhibit them strongly.

Abstract

Reinforcement learning (RL) agents under partial observability often condition actions on internally accumulated information such as memory or inferred latent context. We formalise such information-conditioned interaction patterns as behavioural dependency: variation in action selection with respect to internal information under fixed observations. This induces a probe-relative notion of $ε$-behavioural equivalence and a within-policy behavioural distance that quantifies probe sensitivity. We establish three structural results. First, the set of policies exhibiting non-trivial behavioural dependency is not closed under convex aggregation. Second, behavioural distance contracts under convex combination. Third, we prove a sufficient local condition under which gradient ascent on a skewed mixture objective decreases behavioural distance when a dominant-mode gradient aligns with the direction of steepest contraction. Minimal bandit and partially observable gridworld experiments provide controlled witnesses of these mechanisms. In the examined settings, behavioural distance decreases under convex aggregation and under continued optimisation with skewed latent priors, and in these experiments it precedes degradation under latent prior shift. These results identify structural conditions under which probe-conditioned behavioural separation is not preserved under common policy transformations.

Interpretability & Mechanistic Interp RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

Related Papers