Search papers, labs, and topics across Lattice.
This paper introduces a theoretical framework for in-context continual learning, modeling how Transformers process sequential tasks within a single prompt using shared attention. By analyzing linear and masked linear self-attention, the authors derive error expressions that reveal how standard attention mechanisms induce intertask interference, leading to bias. A bias-variance-interference decomposition of prediction error characterizes conditions for positive or negative transfer, exposing limits of attention-based continual inference.
Standard attention mechanisms inevitably cause intertask interference in in-context continual learning, leading to systematic bias and performance degradation in long prompts.
In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce intertask interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias-variance-interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.