IngonyamaApr 30, 2026arXiv:2604.27743

Why Self-Supervised Encoders Want to Be Normal

AI Summary

This paper presents a geometric and information-theoretic framework for encoder-decoder learning based on the Information Bottleneck (IB) principle, demonstrating that optimal representations are soft clusterings of the predictive manifold. They derive a series of transformations linking the maximum entropy prior on the simplex to Euclidean space, quantifying entropy overhead at each step, and introduce Sketched Isotropic Gaussian Regularization (SIGReg) as a Gaussian relaxation of the IB principle. Experiments validate the rate-distortion trade-offs and show the non-parametric estimator's competitiveness in supervised, semi-supervised, and self-supervised settings.

Key Contribution

Self-supervised encoders implicitly perform soft clustering on a "predictive manifold" in probability space, and this geometric perspective yields a practical Gaussian regularizer (SIGReg) competitive with variational IB.

Abstract

We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as a rate-distortion problem with Kullback-Leibler (KL) divergence as distortion, we show that the optimal representation at any distortion level is a soft clustering of the \emph{predictive manifold} $\mathcal{M}=\{p(Y|x):x\in\mathcal{X}\}$ inside the probability simplex, admitting a linear decoder in the canonical parameterization. We derive a chain of exact transformations, from flat Dirichlet to exponential to isotropic Gaussian, connecting the maximum entropy prior on the simplex to Euclidean space, with quantified entropy overhead at each step, and show that Sketched Isotropic Gaussian Regularization (SIGReg) implements a Gaussian relaxation of this principle whose overhead affects rate accounting but not achievable prediction. This relaxation provides a principled distributional regularizer for learning with limited or no supervision. Using the Conditional Entropy Bottleneck (CEB) decomposition, we derive concrete encoder losses for supervised and semi-supervised settings, estimated via minibatch marginals without variational bounds. In the self-supervised setting, the CEB conditional rate is replaced by a view-prediction proxy. SIGReg serves as the distributional regularizer for both the semi-supervised and self-supervised settings. Experiments on toy problems and FashionMNIST confirm the predicted rate-distortion trade-offs and show that the non-parametric estimator is competitive with the standard variational approach.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Self-Supervised Encoders Want to Be Normal

Related Papers