Jun 1, 2026arXiv:2606.01811

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

Matthew Khoriaty, David Williams-King, Shi Feng

AI Summary

This paper introduces a novel metric for measuring diversity in creative outputs, termed the "Decan" metric, which leverages in-context learning to evaluate the diversity of AI-generated and human-written texts without requiring additional training or reference data. By analyzing per-token log-probabilities from a base language model in a single forward pass, the method captures a wide range of similarities across inputs, offering a robust tool for assessing post-training mode collapse and decoding strategies. The results demonstrate that the Decan metric effectively identifies diversity loss in various model stages, achieving a competitive score on established benchmarks while revealing critical insights into the creative capabilities of language models.

Key Contribution

The Decan metric reveals that diversity in AI-generated content can be quantitatively assessed without additional training, highlighting significant diversity loss across model fine-tuning stages.

Abstract

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

Related Papers