ACE RoboticsCUHKJun 11, 2026arXiv:2606.13679

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Dian Zheng, Dian Zheng, Harry Lee, Ha-Ram Lee, Manyuan Zhang, Manyuan Zhang, Kaituo Feng, Kaituo Feng, Zoey Guo, Zoey Guo, Ray Zhang, Ray Zhang, Hongsheng Li, Hongsheng Li

AI Summary

This paper introduces InterleaveThinker, a novel multi-agent pipeline that enhances existing image generators with interleaved generation capabilities, crucial for applications like visual narratives and embodied manipulation. By employing a planner agent to structure the image-text sequence and a critic agent to evaluate and refine outputs, the method optimizes the generation process through a combination of accuracy and step-wise rewards. The results demonstrate that InterleaveThinker not only matches the performance of advanced models like Nano Banana and GPT-5 on interleaved generation benchmarks but also significantly boosts reasoning capabilities in base models, particularly on the FLUX.2-klein benchmark.

Key Contribution

InterleaveThinker transforms standard image generators into powerful tools for interleaved generation, achieving state-of-the-art performance while enhancing reasoning capabilities.

Abstract

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Related Papers