Mar 4, 2026arXiv:2603.03778

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

AI Summary

This paper tackles the Inverse Contextual Bandit (ICB) problem where an observer tries to infer the underlying bandit parameters from a learner's actions, without access to rewards. The key challenge is the non-stationary behavior of the learner as it transitions from exploration to exploitation. The authors propose a "Two-Phase Suffix Imitation" framework, discarding initial exploration data and then performing empirical risk minimization on the remaining data. They prove a $\tilde O(1/\sqrt{N})$ convergence rate for the observer, matching the performance of a reward-aware learner, despite the information deficit.

Key Contribution

A reward-free observer can achieve the same asymptotic efficiency as a fully reward-aware learner in Inverse Contextual Bandits, simply by ignoring the learner's initial exploratory actions.

Abstract

We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Related Papers