Shanghai InnovationUSTCApr 6, 2026arXiv:2604.05005

EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

Shuzhen Bi, Shuzhen Bi, Mingzi Zhang, Zhuoxuan Li, keqian Li, Aimin Zhou

AI Summary

EduIllustrate is introduced as a benchmark for evaluating LLMs on their ability to generate interleaved text-diagram explanations for K-12 STEM problems, comprising 230 problems across five subjects and three grade levels. The benchmark uses a standardized generation protocol with sequential anchoring to ensure cross-diagram visual consistency and an 8-dimension evaluation rubric. Experiments across ten LLMs reveal Gemini 3.0 Pro Preview achieves the highest score (87.8%), while sequential anchoring significantly improves visual consistency at a lower cost.

Key Contribution

LLMs can now generate coherent, diagram-rich explanations for K-12 STEM problems with high accuracy, opening new avenues for automated educational content creation.

Abstract

Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8\%, while Kimi-K2.5 achieves the best cost-efficiency (80.8\% at \\$0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13\% at 94\% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ($\rho \geq 0.83$) while revealing limitations on subjective visual assessment.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

Related Papers