Apr 27, 2026arXiv:2604.24885

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Maitreya Patel, Jingtao Li, Weiming Zhuang, Yezhou Yang, LingJuan Lv

AI Summary

This paper introduces VibeToken, a novel 1D Transformer-based image tokenizer that encodes images into a variable-length sequence of tokens, enabling resolution-agnostic autoregressive image generation. They then present VibeToken-Gen, a class-conditioned AR generator built upon VibeToken, demonstrating its ability to synthesize high-resolution images (1024x1024) with significantly fewer tokens and FLOPs compared to diffusion models and fixed-resolution AR models like LlamaGen. VibeToken-Gen achieves a gFID of 3.94 at 1024x1024 using only 64 tokens and 179G FLOPs, outperforming a diffusion model requiring 1,024 tokens and 5.87 gFID, and LlamaGen requiring 11T FLOPs.

Key Contribution

Autoregressive image models can now compete with diffusion models in image quality and efficiency, thanks to a variable-length tokenization scheme that decouples compute from resolution.

Abstract

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References52

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Related Papers