Search papers, labs, and topics across Lattice.
This paper introduces VibeToken, a novel 1D Transformer-based image tokenizer that encodes images into a variable-length sequence of tokens, enabling resolution-agnostic autoregressive image generation. They then present VibeToken-Gen, a class-conditioned AR generator built upon VibeToken, demonstrating its ability to synthesize high-resolution images (1024x1024) with significantly fewer tokens and FLOPs compared to diffusion models and fixed-resolution AR models like LlamaGen. VibeToken-Gen achieves a gFID of 3.94 at 1024x1024 using only 64 tokens and 179G FLOPs, outperforming a diffusion model requiring 1,024 tokens and 5.87 gFID, and LlamaGen requiring 11T FLOPs.
Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.