NUSBITEdinburghNTUApr 12, 2026arXiv:2604.10741

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Jianzhu Bao, Shuicheng Yan

AI Summary

The paper introduces Deep-Reporter, a novel agentic framework designed for grounded multimodal long-form generation, addressing the limitations of existing text-centric agentic search frameworks. Deep-Reporter orchestrates agentic multimodal search and filtering, checklist-guided incremental synthesis, and recurrent context management. The authors curate a dataset of 8K high-quality agentic traces and introduce M2LongBench, a comprehensive benchmark, demonstrating that effective post-training can improve multimodal selection and integration for this challenging task.

Key Contribution

Text-centric agentic search is out: Deep-Reporter shows how to build multimodal agents that leverage both text and visuals for grounded long-form generation.

Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

Multimodal Models Recommendation & Information Retrieval Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...