Apr 16, 2026arXiv:2604.15239

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

Jiawei Ren, M. Tyszkiewicz, Michal Jan Tyszkiewicz, Jiahui Huang, Jiahui Huang, Zan Gojcic, Zan Gojcic

AI Summary

This paper introduces TokenGS, a novel 3D Gaussian Splatting (3DGS) prediction method that decouples Gaussian parameter regression from pixel dependencies by directly regressing 3D mean coordinates using a self-supervised rendering loss. By adopting an encoder-decoder architecture with learnable Gaussian tokens, TokenGS overcomes limitations of encoder-only architectures tied to input image resolution and view count. Experiments demonstrate that TokenGS achieves state-of-the-art feed-forward reconstruction performance, improved robustness to pose noise, and emergent scene understanding capabilities like static-dynamic decomposition.

Key Contribution

Forget pixel-aligned Gaussians: TokenGS uses learnable tokens to directly regress 3D Gaussians, unlocking robustness to pose noise and emergent scene understanding.

Abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

Architecture Design (Transformers, SSMs, MoE)Computer Vision

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

Related Papers