Search papers, labs, and topics across Lattice.
This paper investigates the effectiveness of self-reflective prompting for improving medical question answering accuracy in LLMs. Using GPT-4o and GPT-4o-mini, the authors compare standard chain-of-thought prompting with an iterative self-reflection loop across MedQA, HeadQA, and PubMedQA datasets. The results indicate that self-reflection's impact is inconsistent, yielding modest gains on MedQA but limited or negative benefits on HeadQA and PubMedQA, suggesting that reasoning transparency does not guarantee correctness.
Self-reflection in LLMs doesn't consistently improve medical QA accuracy, revealing a disconnect between reasoning transparency and correctness that challenges the assumption that more reflection always leads to better performance.
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.