CASUCASFeb 15, 2026arXiv:2602.14041

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Yuang Ai, Jiaming Han, Shaobin Zhuang, Weijia Mao, Xuefeng Hu, Ziyan Yang, Huaibo Huang, Xiangyu Yue

AI Summary

The paper introduces BitDance, an autoregressive image generator that predicts binary visual tokens, enabling a compact and highly expressive discrete representation with up to 2^{256} states per token. To address the challenge of sampling from this large token space, BitDance employs a binary diffusion head, generating binary tokens through continuous-space diffusion instead of softmax classification. By incorporating a next-patch diffusion method for parallel token prediction, BitDance achieves state-of-the-art FID scores (1.24 on ImageNet 256x256) and significant speedups compared to existing autoregressive models, especially in high-resolution image generation.

Key Contribution

Forget codebook indices: BitDance uses binary diffusion to predict high-entropy binary tokens, achieving SOTA image generation with a fraction of the parameters and a massive speedup.

Abstract

We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to 2^{256} states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Related Papers