Search papers, labs, and topics across Lattice.
I-CompBench, GenEval, and WISE demonstrate new state-of-the-art performance, especially in enhancing spatial relationships. 2 Related Works 2.1 Autoregressive Models in Image Generation Inspired by the success of autoregressive generation in large language models [31, 32, 41], the autoregressive (AR) paradigm has gradually been applied to image generation, discretizing images into sequential tokens and generating them step by step through next-token prediction to produce high-quality images. LlamaGen [18] is an early exploration, showing that autoregressive models can achieve competitive results in image generation. Subsequent research has focused on improving efficiency, resolution, and semantic consistency. AiM [16] leverages the Mamba architecture to optimize long-sequence modeling and accelerate inference; Token-Shuffle [20] reduces token count to enable higher-resolution generation; CTF [8] applies coarse-to-fine token prediction to improve quality; GigaTok [40] uses semantic regularization to handle complex visual tokenizers. Open-MAGVIT2 [19] improves semantic consistency with super-large codebooks and sub-token prediction. More recent works propose next-scale/next-X [26] prediction frameworks and AR models based on large-scale continuous tokens (NextStep-1) [30], achieving significant improvements in generation efficiency, quality, and high-resolution capability. However, challenges remain in optimizing intermediate reasoning and ensuring semantic consistency in complex scenarios. Figure 2: Overview of CoR-Painter: (a) illustration of the text-to-image generation process, and (b) Dual-Objective GRPO, RSAR_{\text{SA}}, RSPR_{\text{SP}} and RHAR_{\text{HA}} represent Semantic Anchoring Reward, Semantic Projection Reward and Holistic Alignment Reward, respectively. 2.2 CoT Reasoning and RL in Image Generation Building on CoT reasoning in large language models, researchers have integrated reasoning and reinforcement learning (RL) into autoregressive image generation, advancing structured frameworks to guide the process. BiCoT-GRPO [14] combines semantic- and token-level reasoning with generation rewards. PARM and PARM++ [9] enhance autoregressive generation through stepwise evaluation using potential assessment rewards and reflection mechanisms. GoT [6] and GoT-R1 [3] integrate semantic-spatial reasoning and RL-based rewards to improve compositional and spatial alignment. Research also focuses on reward mechanisms and training methods to reinforce reasoning-guided generation. SUDER [10] and CoRL [15] explore self-supervised dual rewards and co-reinforcement learning for multimodal optimization. FocusDiff [22] uses RL for fine-grained text-image alignment, addressing semantic subtleties. Despite these advancements, existing approaches rarely optimize textual reasoning and image generation separately. Our method addresses this by introducing objective-specific rewards for dedicated optimization while maintaining cross-modal consistency. 3 Method In this section, we provide the details of CoR-Painter, starting with the generation process and then outlining how to train the model to achieve high-quality images with RL. 3.1 Image Generation Pipeline As previously stated, our generation process follows the “How-to-What” paradigm for image generation. Given an input prompt, we sequentially perform textual reasoning in terms of “How to draw” and “What to draw”, producing constraint-guided instructions and structured logical descriptions that are then mapped to the image generation process, thereby serving as a bridge between linguistic understanding and the final visual rendering. The pipeline of this process is shown in Fig. 2(a). We use Janus-Pro [2] as the base model (i.e., the “Auto-regressive Unified MLLM”) to jointly model text and image tokens within a shared representation space. To guide the model in generation, we design an instruction prompt for the MLLM based on the original prompt. The designed prompt specifies the structure of the textual reasoning content by introducing specific tags that define two components: the <thought></thought> part and the <description></description> part. This design ensures that the generated reasoning aligns with the “How to draw” and “What to draw” processes. The constructed instruction prompt is as follows: Instruction Prompt for the Original Prompt
1
0
2
0
LLMs can power better local search, but only if you ground them geographically, align training with inference, and aggressively prune the vocabulary for speed.