HBKUFeb 15, 2026arXiv:2602.14189

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

AI Summary

This paper introduces an abstention-aware verification framework for scientific claim verification, decomposing claims into minimal conditions, auditing each condition using NLI, and selectively choosing to support, refute, or abstain. The authors evaluate this framework on SciFact and PubMedQA using six diverse language models, finding that abstention significantly reduces risk even with limited accuracy improvements. The key finding is that determining when evidence is sufficient to justify an answer is more critical than simply selecting the best model in scientific reasoning.

Key Contribution

Don't chase higher accuracy in scientific reasoning tasks; strategically abstaining when evidence is weak yields far greater reliability gains.

Abstract

Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at https://github.com/sabdaljalil2000/ai4science .

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Related Papers