Search papers, labs, and topics across Lattice.
This paper investigates scaling laws for geometry-free view synthesis transformers, comparing encoder-decoder and decoder-only architectures. The authors demonstrate that encoder-decoder architectures can be compute-optimal for novel view synthesis, contrary to previous findings that favored decoder-only models. They introduce the Scalable View Synthesis Model (SVSM), an encoder-decoder architecture that achieves state-of-the-art performance on real-world NVS benchmarks with reduced training compute, establishing a superior performance-compute Pareto frontier.
Encoder-decoder architectures can beat decoder-only transformers in novel view synthesis, overturning conventional wisdom with a compute-optimal design (SVSM) that slashes training costs.
Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.