Search papers, labs, and topics across Lattice.
This paper introduces a reward modeling approach to improve spatial understanding in text-to-image generation. They construct a SpatialReward-Dataset of 80k preference pairs and train a reward model, SpatialScore, to evaluate the accuracy of spatial relationships. Online reinforcement learning using SpatialScore significantly improves spatial understanding in generated images, outperforming existing models on spatial relationship benchmarks.
A reward model trained on spatial relationship preferences beats proprietary models at spatial understanding in text-to-image generation, and unlocks better RL-based image generation.
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.