Search papers, labs, and topics across Lattice.
This paper introduces FEPBench, a novel benchmark for evaluating Text-to-Image (T2I) models specifically in the context of scientific illustration generation. By employing multimodal large language models (MLLMs) and human expert annotations, the authors assess T2I outputs on three critical dimensions: instruction faithfulness, reasoning enrichment, and semantic precision, while also breaking down performance across various visual and textual elements. The results reveal that even leading models struggle with text-rendering and reasoning, highlighting significant areas for improvement in T2I applications for scientific communication.
Even state-of-the-art T2I models falter in generating scientifically accurate illustrations, revealing critical gaps in text-rendering and reasoning capabilities.
Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.