Apr 7, 2026arXiv:2604.05557

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, Wanxiang Che

AI Summary

EpiBench is introduced as a new benchmark to evaluate multimodal agents on multi-turn research workflows, requiring proactive literature search, evidence integration from figures/tables, and sustained memory use. The benchmark involves agents navigating papers over multiple turns to align evidence and answer questions demanding cross-paper comparisons and multi-figure integration. Experiments reveal that even state-of-the-art models achieve only 29.23% accuracy on the hard split, highlighting significant limitations in current agents' ability to perform complex research tasks.

Key Contribution

Current multimodal agents are surprisingly bad at research workflows, struggling to integrate evidence across papers and figures in multi-turn settings.

Abstract

Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Related Papers