Search papers, labs, and topics across Lattice.
This paper investigates the detection of gradual observation drift in RL agents using world model-based self-monitoring across MuJoCo environments. It identifies a sharp detection threshold (ε*) below which drift is absorbed and above which detection occurs rapidly, and shows this threshold is invariant across detector families and model capacities. The study also reveals that sinusoidal drift is undetectable and that the detection threshold follows a power law within each environment but fails for cross-environment prediction, highlighting the importance of environment-specific dynamics.
RL agents can completely miss gradual observation drift until it's too late, with a sharp "boiling frog" threshold determining when they finally wake up to the problem.
When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold $\varepsilon^*$ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, $\varepsilon^*$ follows a power law in detector parameters ($R^2 = 0.89$-$0.97$), but cross-environment prediction fails ($R^2 = 0.45$), revealing that the missing variable is environment-specific dynamics structure $\partial \mathrm{PE}/\partial\varepsilon$. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.