AI LaboratoryFudanNankai UniversityShanda AI Research TokyoShanghai InnovationSJTUUniversity of Science and TechnologyWHUJun 4, 2026arXiv:2606.05949

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

AI Summary

This paper introduces FEPBench, a novel benchmark for evaluating Text-to-Image (T2I) models specifically in the context of scientific illustration generation. By employing multimodal large language models (MLLMs) and human expert annotations, the authors assess T2I outputs on three critical dimensions: instruction faithfulness, reasoning enrichment, and semantic precision, while also breaking down performance across various visual and textual elements. The results reveal that even leading models struggle with text-rendering and reasoning, highlighting significant areas for improvement in T2I applications for scientific communication.

Key Contribution

Even state-of-the-art T2I models falter in generating scientifically accurate illustrations, revealing critical gaps in text-rendering and reasoning capabilities.

Abstract

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

Computer Vision Eval Frameworks & Benchmarks Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

Related Papers