Google ResearchFeb 19, 2026arXiv:2602.17270

Unified Latents (UL): How to train your latents

J. Heek, Jonathan Heek, Emiel Hoogeboom, Emiel Hoogeboom, Thomas Mensink, Thomas Mensink, Tim Salimans, Tim Salimans

AI Summary

The paper introduces Unified Latents (UL), a framework for learning latent representations jointly regularized by a diffusion prior and decoded by a diffusion model. UL links the encoder's output noise to the prior's minimum noise level, resulting in a training objective that provides a tight upper bound on the latent bitrate. Experiments on ImageNet-512 and Kinetics-600 demonstrate that UL achieves competitive or state-of-the-art generative performance with improved training efficiency compared to existing latent diffusion models.

Key Contribution

Ditch Stable Diffusion's latents: Unified Latents (UL) achieves state-of-the-art video generation and competitive image generation with fewer training FLOPs.

Abstract

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unified Latents (UL): How to train your latents

Related Papers