BITHITJD.comSenseTimeSJTUJun 17, 2026arXiv:2606.19256

X+Slides: Benchmarking Audience-Conditioned Slide Generation

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

AI Summary

This paper introduces X+Slides, a benchmark for audience-conditioned slide generation that evaluates how well generated slides meet the specific needs of different audiences, such as specialists versus decision-makers. By utilizing a diverse corpus and a dynamic evaluation framework with 8,133 source-grounded probes, the authors measure metrics like Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness. Experiments reveal that while existing systems like DeepPresenter and SlideTailor can convey significant audience-essential information, they still fall short, highlighting the necessity for audience-specific evaluations in slide generation.

Key Contribution

Current slide generation models miss critical audience-specific information, with DeepPresenter only achieving 71.4% coverage of essential content for specialists and decision-makers.

Abstract

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

X+Slides: Benchmarking Audience-Conditioned Slide Generation

Related Papers