TechnionFeb 12, 2026arXiv:2602.11908

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

AI Summary

The paper introduces Selective Abstraction (SA), a framework for improving the reliability of long-form text generation by allowing LLMs to selectively reduce the specificity of uncertain content instead of abstaining entirely. They formalize SA using selective risk and coverage metrics and propose Atom-wise Selective Abstraction, which decomposes responses into atomic claims and replaces uncertain claims with more general abstractions. Empirical evaluation on FactScore and LongFact-Objects benchmarks demonstrates that Atom-wise SA significantly improves the risk-coverage trade-off compared to claim removal, boosting AURC by up to 27.73% across six open-source models.

Key Contribution

LLMs can significantly boost factual accuracy in long-form generation by strategically "toning down" uncertain details, rather than simply omitting them.

Abstract

LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary"all-or-nothing"approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

Related Papers