BeihangPKUJun 1, 2026arXiv:2606.02320

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Feng Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

AI Summary

This paper introduces TVIR, a novel framework for Text-Visual Interleaved Report Generation that addresses the limitations of existing text-centric benchmarks by incorporating visual elements into deep research tasks. The authors present TVIR-Bench, a benchmark of 100 multimodal tasks, and TVIR-Agent, a hierarchical multi-agent system that effectively integrates text and visual data for comprehensive report generation. Experiments demonstrate that TVIR-Agent significantly enhances performance in generating evidence-driven reports, highlighting the critical role of multimodal design in research applications.

Key Contribution

TVIR-Agent reveals that integrating visual elements into report generation can dramatically improve the quality and reliability of analytical outputs.

Abstract

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Related Papers