Search papers, labs, and topics across Lattice.
This study investigates the phenomenon of output diversity collapse in post-trained language models, revealing that the loss of varied outputs is closely linked to the composition of training data rather than just the post-training methods employed. By analyzing three distinct post-training approaches鈥擮lmo 3, Think, and Instruct鈥攁cross multiple tasks, the authors find that the Think lineage experiences significant semantic diversity loss during supervised fine-tuning, while Instruct models exhibit a more pronounced effect from direct preference optimization (DPO). Ultimately, the research concludes that diversity collapse is a consequence of training data choices, emphasizing that addressing it requires intervention during the training phase rather than at inference time.
Output diversity in post-trained models collapses due to training data composition, not just post-training methods, challenging assumptions about inference-time fixes.
Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.