GE HealthCareMar 18, 2026arXiv:2603.17580

Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

S. Sahoo, Soumya Ranjan Sahoo, N. Gagan, Gagan N., Sanand Sasidharan, Divya Bharti

AI Summary

This paper investigates failure modes of dense retrieval methods in contradiction-aware biomedical question answering, finding that complex dense retrieval strategies suffer from "Semantic Collapse" where negation signals are lost. To address this, they propose a Decoupled Lexical Architecture built on a BM25 backbone, which balances support recall and contradiction surfacing. Experiments on a SciFact-based proxy benchmark and the TREC 2025 BioGen track demonstrate that this approach achieves state-of-the-art performance in contradiction detection and citation coverage, enabling more reliable and trustworthy biomedical QA.

Key Contribution

Seemingly sophisticated dense retrieval methods can catastrophically fail at contradiction detection due to "Semantic Collapse," highlighting the surprising effectiveness of a simple, decoupled lexical approach for reliable biomedical QA.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in biomedical question answering, yet their tendency to generate plausible but unverified claims poses serious risks in clinical settings. To mitigate these risks, the TREC 2025 BioGen track mandates grounded answers that explicitly surface contradictory evidence (Task A) and the generation of narrative driven, fully attributed responses (Task B). Addressing the absence of target ground truth, we present a proxy-based development framework using the SciFact dataset to systematically optimize retrieval architectures. Our iterative evaluation revealed a"Simplicity Paradox": complex adversarial dense retrieval strategies failed catastrophically at contradiction detection (MRR 0.023) due to Semantic Collapse, where negation signals become indistinguishable in vector space. We further identify a Retrieval Asymmetry: filtering dense embeddings improves contradiction detection but degrades support recall, compromising reliability. We resolve this via a Decoupled Lexical Architecture built on a unified BM25 backbone, balancing semantic support recall (0.810) with precise contradiction surfacing (0.750). This approach achieves the highest Weighted MRR (0.790) on the proxy benchmark while remaining the only viable strategy for scaling to the 30 million document PubMed corpus. For answer generation, we introduce Narrative Aware Reranking and One-Shot In-Context Learning, improving citation coverage from 50% (zero-shot) to 100%. Official TREC results confirm our findings: our system ranks 2nd on Task A contradiction F1 and 3rd out of 50 runs on Task B citation coverage (98.77%), achieving zero citation contradict rate. Our work transforms LLMs from stochastic generators into honest evidence synthesizers, showing that epistemic integrity in biomedical AI requires precision and architectural scalability isolated metric optimization.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Negation is Not Semantic: Diagnosing Dense Retrieval Failure Modes for Trade-offs in Contradiction-Aware Biomedical QA

Related Papers