Apr 28, 2026arXiv:2604.25779

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

AI Summary

This paper investigates subliminal learning, where a student model acquires unintended traits from a teacher model even when trained only on no-class logits via distillation. The authors empirically demonstrate that sustained, albeit weak, positive alignment between the trait and distillation gradients throughout multi-step training causally contributes to this trait acquisition. They also show that liminal training, a method designed to mitigate this effect, fails because it attenuates but does not eliminate this gradient alignment.

Key Contribution

Even when you think you're only teaching a model what *not* to do, sustained gradient alignment can lead to the unintended acquisition of undesirable traits.

Abstract

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Related Papers