Mar 3, 2026arXiv:2603.03054

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

AI Summary

PrivMedChat is introduced as an end-to-end differentially private RLHF (DP-RLHF) framework tailored for medical dialogue systems, addressing privacy concerns associated with using sensitive doctor-patient conversations for training. The framework employs DP-SGD during medical SFT and reward model learning, and limits privacy expenditure during alignment by applying DP-SGD to the PPO actor and critic while keeping the reward model fixed. Experiments demonstrate that PrivMedChat achieves state-of-the-art performance among DP models on medical dialogue benchmarks, with strong utility and near-chance membership inference signals.

Key Contribution

You can now train medical dialogue LLMs with differential privacy guarantees, achieving strong utility while minimizing the risk of memorizing and leaking sensitive patient data.

Abstract

Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

Related Papers