Search papers, labs, and topics across Lattice.
Mila - Quebec AI Institute, Université de Montréal 4 McGill University Correspondence to amcb6@cam.ac.uk. Abstract In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods either cannot improve beyond the training distribution or require near-optimal data, limiting practical adoption. We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time using in-context information through Bayesian updates. To recover from poor priors resulting from training on sub-optimal data, our online inference follows an Upper-Confidence Bound rule that favours exploration and adaptation. We prove that SPICE achieves regret-optimal behaviour in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We validate these findings empirically across bandit and control benchmarks. SPICE achieves near-optimal decisions on unseen tasks, substantially reduces regret compared to prior ICRL and meta-RL approaches while rapidly adapting to unseen tasks and remaining robust under distribution shift. 1 Introduction Following the success of transformers with in-context learning abilities vaswani2017attention, In-Context Reinforcement Learning (ICRL) emerged as a promising paradigm chen2021decision; zheng2022online. ICRL aims to adapt a policy to new tasks using only a context of logged interactions and no parameter updates. This approach is particularly attractive for practical deployment in domains where training classic online RL is either risky or expensive, where abundant historical logs are available, or where fast gradient-free adaptation is required. Examples include robotics, autonomous driving or buildings energy management systems. ICRL improves upon classic offline RL by amortising knowledge across tasks, as a single model is pre-trained on trajectories from many environments and then used at test time with only a small history of interactions from the test task. The model must make good decisions in new environments using this in-context dataset as the only source of information moeini2025survey. Existing ICRL approaches suffer from three main limitations. First, behaviour-policy bias from supervised training objectives: methods trained with Maximum Likelihood Estimation (MLE) on actions inherit from the same distribution as the behaviour policy. When the behaviour policy is suboptimal, the learned model performs poorly. Many ICRL methods fail to improve beyond the pretraining data distribution and essentially perform imitation learning dong2025context; lee2023supervised. Second, existing methods lack uncertainty quantification and inference-time control. Successful online adaptation requires epistemic uncertainty over action values to enable temporally coherent exploration. Most ICRL methods expose logits but not actionable posteriors over Q-values, which are needed for principled exploration like Upper Confidence Bound (UCB) or Thompson Sampling (TS) lakshminarayanan2017simple; osband2016deep; osband2018randomized; auer2002using; russo2018tutorial. Third, current algorithms have unrealistic data requirements that make them unusable in most real-world deployments. Algorithm Distillation (AD) laskin2022context requires learning traces from trained RL algorithms, while Decision Pretrained Transformers (DPT) lee2023supervised needs optimal policy to label actions. Recent work has attempted to loosen these requirements, like Decision Importance Transformers (DIT) dong2025context and In-Context Exploration with Ensembles (ICEE) dai2024context. However, these methods lack explicit measure of uncertainty and test-time controller for exploration and efficient adaptation. To address these limitations, we introduce SPICE (Shaping Policies In-Context with Ensemble prior), a Bayesian ICRL algorithm that maintains a prior over Q-values using a deep ensemble and updates this prior with state-weighted evidence from the context dataset. The resulting per-action posteriors can be used greedily in offline settings or with a posterior-UCB rule for online exploration, enabling test-time adaptation to unseen tasks without parameter updates. We prove that SPICE achieves regret-optimal performance in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We test our algorithm in bandit and dark room environments to compare against prior work, demonstrating that our algorithm achieves near-optimal decision making on unseen tasks while substantially reducing regret compared to prior ICRL and meta-RL approaches. This work paves the way for real-world deployment of ICRL methods, which should feature good uncertainty quantification and test-time adaptation to new tasks without relying on unrealistic optimal control trajectories for training. 2 Related Work Meta-RL. Classical meta-reinforcement learning aims to learn to adapt across tasks with limited experience. Representative methods include RL2 duan2016rl, gradient-based meta-learning such as MAML finn2017model; and probabilistic context–variable methods such as PEARL rakelly2019efficient. These approaches typically require online interaction and task‑aligned adaptation loops during deployment. Sequence modelling for decision-making. Treating control as sequence modelling has proven effective with seminal works such as Decision Transformer (DT) chen2021decision and Trajectory Transformer models janner2021offline. Scaling variants extend DT to many games and longer horizons lee2022multi; correia2023hierarchical, while Online Decision Transformer (ODT) blends offline pretraining with online fine‑tuning via parameter updates zheng2022online. These works paved the way for in context decision making. In-context RL via supervised pretraining. Two influential ICRL methods are Algorithm Distillation (AD) laskin2022context, which distills the learning dynamics of a base RL algorithm into a Transformer that improves in‑context without gradients, and Decision‑Pretrained Transformer (DPT) lee2023supervised, which is trained to map a query state and in‑context experience to optimal actions and is theoretically connected to posterior sampling. Both rely on labels generated by strong/optimal policies (or full learning traces) and therefore inherit behaviour-policy biases from the data moeini2025survey. DIT (dong2025context) improves over behaviour cloning by reweighting a supervised policy with in-context advantage estimates, but it remains a purely supervised objective: it exposes no calibrated uncertainty, produces no per-action posterior, and lacks any inference-time controller or regret guarantees. ICEE (dai2024context) induces exploration–exploitation behaviour inside a Transformer at test time, yet it does so heuristically, without explicit Bayesian updates, calibrated posteriors, or theoretical analysis. By contrast, SPICE is the first ICRL method to (i) learn an explicit value prior with uncertainty from suboptimal data, (ii) perform Bayesian context fusion at test time to obtain per-action posteriors, and (iii) act with posterior-UCB, yielding principled exploration and a provable O(logK)O(\log K) regret bound with only a constant warm-start term. 3 SPICE: Bayesian In-Context Decision Making In this section, we introduce the key components of our approach. We begin by formalising the ICRL problem and providing a high-level overview of our method in Sec. 3.1. The main elements of the model architecture and training objective are described in Sec. 3.2. Our main contribution, the test-time Bayesian fusion policy, is introduced in Sec. 3.3 3.1 Method Overview Consider a set 𝒯\mathcal{T} of tasks with a state space 𝒮\mathcal{S}, an action space 𝒜\mathcal{A}, an horizon HH, a per-step reward rtr_{t}, and discount γ\gamma. In in-context reinforcement learning, given a task T∼𝒯T\sim\mathcal{T} the agent must chose actions to maximise the expected discounted return over the trajectory. During training, the agent learns from trajectories collected either offline or online on different tasks. A test time the agent is given a new task and a context C={(st,at,rt,st+1)}t=1hC=\{(s_{t},a_{t},r_{t},s_{t+1})\}_{t=1}^{h} and a query state sqrys_{qry}. The context either comes from offline data or collected online. The goal is to choose an action a=π(sqry,C)a=\pi(s_{qry},C) that maximises the expected return. The adaptation of the policy the new task is done in context, without any parameter update. Our algorithm, Shaping Policies In-Context with Ensemble prior (SPICE), solves the ICLR problem with a Bayesian approach. It combines a value prior learned from training tasks with task-specific evidence extracted from the test-time context. SPICE first encodes the query and context using a transformer trunk and then produces a calibrated per-action value prior via a deep ensemble. Weighted statistics are extracted from the context using a kernel that measures state similarity. Prior and context evidence are then fused through a closed-form Bayesian update. Actions can be selected greedily or with respect to a posterior-UCB rule for principled exploration. This design enables SPICE to adapt quickly to new tasks and overcome the behaviour-policy bias, even when trained on suboptimal data. SPICE introduces three key contributions: (1) a value-ensemble prior that provides calibrated epistemic uncertainty from suboptimal data, (2) a weighted representation-shaping objective that enables the trunk to support reliable value estimation, and (3) a test-time Bayesian fusion controller that produces per-action posteriors and enables coherent in-context exploration via posterior-UCB. The approach is summarised in Fig. 1 and the full algorithm is described in Algo. 1 along with the detailed architecture in Fig. 5. Note that our approach focuses on discrete action spaces 𝒜\mathcal{A}, but it extends naturally to continuous actions.
Mila1
0
2
0
Even when trained on suboptimal data, a Bayesian in-context RL agent can achieve near-optimal decisions on unseen tasks by fusing a learned Q-value prior with in-context information and employing an upper-confidence bound for exploration.