Search papers, labs, and topics across Lattice.
The paper introduces TARSE, a test-time adaptation method for clinical question answering agents that leverages retrieval of both clinical skills (guidelines, protocols) and prior experiences (reasoning trajectories). TARSE constructs skill and experience libraries and uses a step-aware retriever to select relevant items for each case, adapting the language model to align its reasoning with clinically valid logic. Experiments on medical QA benchmarks demonstrate that TARSE outperforms medical RAG baselines and prompting-only methods, indicating the benefits of explicitly retrieving and aligning with clinical skills and experiences.
Clinical question answering gets a boost: TARSE aligns language model reasoning with clinically valid logic by retrieving and adapting to relevant skills and prior reasoning experiences at test time.
Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.