EPFLWHUApr 21, 2026arXiv:2604.20022

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, A. Kulinkina, Akhil Arora, L. Klein, Mary-Anne Hartley

AI Summary

The paper introduces BMBE, a modular diagnostic dialogue framework that separates natural language communication (handled by an LLM) from probabilistic reasoning (handled by a Bayesian engine). This separation enables calibrated selective diagnosis, privacy by construction, and robustness to adversarial inputs, properties not achievable by standalone LLMs. Experiments using both empirical and LLM-generated knowledge bases demonstrate that BMBE, even with a cheap LLM sensor, outperforms frontier standalone LLMs in diagnostic accuracy and robustness.

Key Contribution

Separating language understanding from probabilistic reasoning in medical dialogue agents yields a system that's not only more accurate and robust, but also private and auditable by design.

Abstract

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

Natural Language Processing Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

Related Papers